Scenario: A service hosted in two regions. Each is fronted by either an ALB or ELB with an attached Autoscale group, that uses the LB healthcheck to determine instance health. A Route53 configuration balances trafic between the two. Route53 ‘Evaluate Target Health’ is set to yes and no healthcheck is attached.
Under the ELB, if the backend application fails in a region, the ELB will trigger termination of the application nodes. Route53 will consider the region unhealthy if all the backends are sick or unregistered and fail over to the remaining region.
Under the ALB, the same occurs except Route53 does not consider an empty ALB as unhealthy and will continue to send traffic to a region with no registered backends.
This is to a certain extent understandable, as ALBs allow attaching multiple target groups and its not immediately obvious what to do when there is a mix of statuses. I suspect the common case is that most ALBs have exactly one target group attached though and that could be used as the status, or allow Route53 to be bound to a specific target group.
The current workaround is to use a Route53 Healthcheck (at an additional $1+ per month per check) to have Route53 perform an application healthcheck against each origin.
The CentOS 7 AMI in Amazon comes with Cloud-Init (cloud-init-0.7.5-10.el7.centos.1.x86_64). This is quite handy as it assists in automating several bootup tasks. One of these tasks is to install and bootstrap Chef. Unforunately, when SELinux is installed the Chef handler will fail.
[CLOUDINIT] util.py[DEBUG]: Restoring selinux mode for /var/lib (recursive=True)
[CLOUDINIT] util.py[DEBUG]: Running chef (<module 'cloudinit.config.cc_chef' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_chef.py'>) failed
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/cloudinit/stages.py", line 658, in _run_modules
cc.run(run_name, mod.handle, func_args, freq=freq)
File "/usr/lib/python2.7/site-packages/cloudinit/cloud.py", line 63, in run
return self._runners.run(name, functor, args, freq, clear_on_fail)
File "/usr/lib/python2.7/site-packages/cloudinit/helpers.py", line 197, in run
results = functor(*args)
File "/usr/lib/python2.7/site-packages/cloudinit/config/cc_chef.py", line 54, in handle
File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 1291, in ensure_dir
File "/usr/lib/python2.7/site-packages/cloudinit/util.py", line 167, in __exit__
File "/usr/lib64/python2.7/site-packages/selinux/__init__.py", line 95, in restorecon
for fname in fnames]), None)
File "/usr/lib64/python2.7/posixpath.py", line 246, in walk
walk(name, func, arg)
File "/usr/lib64/python2.7/posixpath.py", line 238, in walk
func(arg, top, names)
File "/usr/lib64/python2.7/site-packages/selinux/__init__.py", line 95, in <lambda>
for fname in fnames]), None)
File "/usr/lib64/python2.7/site-packages/selinux/__init__.py", line 85, in restorecon
status, context = matchpathcon(path, mode)
OSError: [Errno 2] No such file or directory
All nodes in EC2 can fetch their Instance Identity Documents. This returns an AWS-signed block containing data about the instance that requested it. This data is only available to the instance that fetched it, so if it turns around and presents it to some other service you can be confident it originated from the machine in question.
This is useful for secrets-enrollment processes where you want a node to be able to attest to its own identity in a secure fashion. Your enrollment mechanism can ensure that the node is ‘young’ enough to be making the request, and that the enrollment has never occured before. Your enrollment tool can also look up other data about the instance (autoscale group, CloudFormation stack, etc) to determine what privileges it should be granted.
In this manner you can pivot on the security features AWS offers for identifying a particular node to another tool.
I’ve been looking for a way to run Consul ‘standalone’ on a host and let multiple ECS containers connect to it. I was hoping to find a macro I could put in a container definition but such does not yet exist. Instead I realized that I can have my Docker instance query the instance metadata service and get this information. It is not quite as elegant docker-wise but it should work until something better comes along.
While pondering running a full ‘ride-the-wave’ auto-scaling solution in AWS, I looked closely at my Chef installation. The environment is very Chef-heavy with fairly generic AMI that runs a slew of Chef recipes to bring it up to the needs of that particular role. The nodes are in Autoscale groups and get their role designation from the Launch Configuration.
On average a node invoked approximately 55 recipes (as recorded by seen_recipes in the audit cookbook). Several of those recipes bring in resources from (authenticated) remote locations that have very good availability but are not in my direct control. Ignoring the remote-based recipes there is still a significant number of moving parts that can be disrupted unexpectedly, such as by other cookbook / role / KV store changes. This is acceptable-if-not-ideal when nodes are generally brought into being under the supervision of an operator who can resolve any issues, or for when the odd 1-of-x00 pooled nodes dies and is automatically replaced. This risk is manageable when the environment is perpetually scaled for peak traffic.
However, when critical nodes are riding the wave of capacity then the chance that something will eventually break during scale-up and cause the ‘wave’ to swamp the application becomes 100%. That requires an operator to adjust the problem under a significant time crunch as the application is overwhelmed by traffic — hardly a recipe for success. The more likely scenario is breakage at some odd hour of the morning as users wake up, and the application fails before an operator can intervene to keep the application alive.
I looked at my Chef construction and realized it was less Infrastructure As Code (IaC) and more like Compile In Production (CIP).
This tells me that ECS / Docker enforced the memory limit for the container, and the out-of-memory-killer killed off the contained processes. Raising the ECS memory limit for this process resolved the issue.
Carl’s Jr has a nifty nutritional calculator / order planner at http://www.carlsjr.com/menu/nutritional_calculator. It lets you fully customize your meal, then lets you print or email your order to yourself with all the magic words to say to get your meal as planned (subbing cheese, extra / 2x / no onion, etc).
Tonight I used this to pre-assemble a highly customized meal for my family. I triggered it to send me an email (easier to read on my phone at the order window) and anxiously awaited.
No email was received.
I host my own email and use http://rollernet.us for my public incoming MX relays; they are nifty as they have a ton of highly configurable anti-spam features that I can apply ‘at the edge’ and lets my actual mailserver run much leaner since SpamAssassin et all are resource intensive.
Rollernet logs stated:
Connection from 18.104.22.168 rejected by mail.rollernet.us
Reason: SPF fail (Mechanism -all matched)
Oh ho! I temporarily disabled SPF checking (and greylisting) and sent another meal through, and the email header said:
Received-SPF: fail (carlsjr.com: Sender is not authorized by default to use 'firstname.lastname@example.org' in 'mfrom' identity (mechanism '-all' matched)) receiver=mail2.rollernet.us; identity=mailfrom; envelope-from="email@example.com";
This tells me they use outlook.com for their internal email, and only allow a few subnets to originate mail from them — and the source IP I got was not one of them. 22.214.171.124 resolves to ec2-54-236-168-30.compute-1.amazonaws.com. www.carlsjr.com resolves to what appears to be a Cloudformation-based AWS environment:
$ host www.carlsjr.com
www.carlsjr.com is an alias for CKEMKTPRDLB-20130419-1810707626.us-east-1.elb.amazonaws.com.
CKEMKTPRDLB-20130419-1810707626.us-east-1.elb.amazonaws.com has address 126.96.36.199
CKEMKTPRDLB-20130419-1810707626.us-east-1.elb.amazonaws.com has address 188.8.131.52
I suspect their Carl’s AWS-based web farm is generating the outgoing mails directly, and they have not accounted for that in their SPF configuration. Using Amazon Simple Email Service (SPF) would have accounted for this already (http://docs.aws.amazon.com/ses/latest/DeveloperGuide/authenticate-domain.html). This could also be handled by designating a outgoing mail host from their AWS environment with an Elastic IP attached and add it to their SPF record.
I sent them an email detailing the issue at their corporate email address. I’ll update if I hear back, but I don’t expect it will ever reach anyone who knows what to do with it.
Update: I got an email back stating the issue was being routed ‘to the appropriate department’.
An odd issue I ran into the other day: I had a Varnish 3 instance that had logic hinging on req.backend.healthy to show a special error page if all the backends were down. That logic inexpliciably triggered even though all my backends were up! After much head-scratching I identified the issue: one of my historical VCLs was still loaded that no longer had any healthy backends (due to repeated autoscaling up / down), and although the current definition of that director had healthy backends, the historical one did not. Varnish has a habit of not letting go of old VCLs even if you specify vcl.discard on them. So, req.backend.healthy will show the director as down if any prior definition of that director is down. Since the only way to definitively remove VCLs from memory is a restart (which flushes the memory cache), this makes req.backend.healthy fairly unreliable.
Two Factor Authentication (2FA) is an additional layer of protection you can add to your Minecraft server. You should already be relying on SSH keys to access your server, but those keys can be lost or leaked if you use them on untrusted machines. 2FA protects you from hacks resulting from someone gaining access to your password or SSH key.
DuoSecurity is a company that provides a 2FA service that is free for personal use. A Mobile Push notification is sent via their Android / iOS app whenever a login is attempted and requires an affirmative response before login can proceed. There is also SMS and computer-generated voice calls, but those consume credits that have a cost to refill. There is also a cost for going over 10 users, which should not be an issue for adminstation of a Minecraft server.
I performed a source install to get the latest version, there are also packages for RHEL/CentOS/Debian/Ubuntu on their website.
They offer a SSH-specific installation and a PAM installation that covers all auth on the machine. PAM is likely the more comprehensive solution but is a more involved process, plus requires a SELinux policy update (and the server resulting from this series has SELinux enabled). Their website tells you how to install the relevant SELinux policy but the necessary objects are missing from their download. I’ve send them a mail and will update if I get clarification. For the record, the error I got was:
make: Entering directory `/home/centos/duo_unix-1.9.14/pam_duo'
checkmodule -M -m -o authlogin_duo.mod authlogin_duo.te
checkmodule: loading policy configuration from authlogin_duo.te
checkmodule: unable to open authlogin_duo.te
make: [semodule] Error 1 (ignored)
semodule_package -o authlogin_duo.pp -m authlogin_duo.mod
semodule_package: Could not open file No such file or directory: authlogin_duo.mod
make: [semodule] Error 1 (ignored)
semodule -i authlogin_duo.pp
semodule: Failed on authlogin_duo.pp!
make: [semodule] Error 1 (ignored)
make: Leaving directory `/home/centos/duo_unix-1.9.14/pam_duo'
SSH setup is straightforward. Install the DuoSecurity app on your mobile device. Create an account on their website. Navigate their portal and select Applications, then +New Application. Give it name of Minecraft SSH. This will result in a window showing a Integration key, Secret key (requiring a click to show), and an API hostname. Collect those and store them to the side (securely!)
tar zxf duo_unix-latest.tar.gz
./configure --prefix=/usr && make && sudo make install
After installing the binaries, modify /etc/duo/login_duo.conf and fill in the ikey/skey/host values from above. Uncomment pushinfo as well so that more details about the login are sent to you.
https://www.duosecurity.com/docs/duounix#centos describes modifying SSH to work with the new system. Keep an alternate root-logged in session open in an alternate window while attempting this so you can back out if you make a mistake. Do not log out of the ‘backup’ window until you’ve thoroughly exercised the system. I will not be responsible if you lock yourself out.
Add ForceCommand /usr/sbin/login_duo to the end of /etc/sshd/sshd_config, then restart sshd (which does not kick out existing sessions) with systemctl restart sshd.service
Now attempt to log in to your account. You will be prompted to enroll in 2FA. Upon logging in again you will be prompted for your 2FA method, which will look like:
$ ssh minecraft.example.com -i ~/.ssh/minecraft.pem -l centos
Duo two-factor login for centos
Enter a passcode or select one of the following options:
1. Duo Push to XXX-XXX-1234
2. Phone call to XXX-XXX-1234
3. SMS passcodes to XXX-XXX-1234
I always choose #1 as I do not want to expend credits, though if push failed I could resort to the other two. When I choose #1 I get a popup on my phone asking me to accept / deny the login, and after choosing Accept I am able to log in.
Now I do not have to worry someone will get ahold of my minecraft.pem and gain access to the server. Further, if I get a login-request on my phone and haven’t attemped to log in, I know my key has been compromised.
I had cause to call CenturyLink support recently. As one of the troubleshooting steps they seriously claimed that a power strip is not capeable of providing sufficient power to a DSL modem, and that it must be connected directly to the wall to receive sufficient power.
I suspect this was to force me to unplug and powercycle the modem, but I had done that several times already. The idea that a DSL modem draws more power than will flow through a power strip is ridiculous.