We take a pride here at HATech in being technology and cloud agnostic. We have great relationships with our vendors but our customers' business goals are what ultimately defines the solution. In the last week alone, we have had projects on IBM Bluemix, Openstack, AWS, Google Cloud Engine, native Docker as well as Joyent Triton. In addition we have customers that choose to use AWS with CloudFlare, or AWS ECS with Microsoft Azure for authentication. The ability to deploy, automate and deliver across the right combination of tools is essentially what 'cloud' is really about, and yet all too frequently we fall into the trap of thinking we should ONLY use one cloud provider for everything. The challenge however, is that there are very few tools that work across the board or allow the creation of plugins to deliver the best solution for a customer's needs.

Like many, our road to AWS automation started with CloudFormation. While not perfect it provides the mechanism to describe architecture and infrastructure as code. Templates can be created to describe nearly all AWS services and their configurations. The main problem however is that it doesn't cope well with infrastructure updates and there is always an uncertainty what services may or may not be torn down and replaced rather than updated.

In the beginning - CloudFormation Nested Templates

Over the years we have used CloudFormation for AWS as well as for OpenStack through the HEAT API. The golden rule when creating a CloudFormation template is that you should break down the templates into multiple templates that can be easily developed, controlled and released. This model then supports a high degree of parallel workflows, enabling many team members to craft a customers' solution without the risk of tripping over each other. CloudFormation supports nesting templates by having one template call additional child templates facilitating the seamless passing of outputs and parameters between the nested templates.

A good example is our Jenkins template. For each new customer we create a Jenkins server in it's own VPC. The template for the VPC is nested within the Jenkins template so that when executed the Jenkins template calls the VPC template, creates the VPC, subnets, security groups etc, and then automatically passes all those variables back to the parent Jenkins template. The parent template then uses these received values to drive the Autoscale group, Launch Configuration and load balancers for the Jenkins instances, eventually arriving at a fully secured Jenkins server environment.

The problem however is what happens when I have an AWS solution with containing multiple tools and services? SQS, SNS, RDS, Redshift, Data Pipeline, S3 and Lambda? Do I keep adding more child templates? How do I manage this and how do I stop one failure impacting my whole application?

There are no tools to help with nested templates but more importantly, when a child template in a nested deployment fails, by default it will tear down everything as the intrinsic relationship between all the nested templates is seen to be a single deployment method. This is not beneficial in a real world scenario in which a poorly updated template could result in an entire production environment being wiped out.

1st Maturity Step - Jenkins to drive Nested Templates

Our preferred solution is to abstract the nested templates and return to our software engineering and Jenkins pipeline methodology. While we agreed earlier in our development that CloudFormation was the right tool to create infrastructure as code for Openstack and AWS, we have since come to understand that the mechanism simply isn't scalable across all of AWS services. By revisiting the fundamentals of software design, testing and validation, we decided to move these templates to Jenkins, allowing us the opportunity to test each template has been deployed and verified before deploying the next. Creating nested templates in Jenkins allowed us the additional benefit of being able to delete any misbehaving component of the infrastructure, re-run the Jenkins job and let it rebuild the missing component.

In the example below we have a simple pipeline that deploys each component of a demo environment we use for training. Each of the jobs shown below is a CloudFormation template that is driven by the outputs generated by the preceding template.

The high level process for the above pipeline goes like this:

    1) Create a shared workspace and git clone the version of the CloudFormation templates I want. 2) Trigger the VPC build using the CloudFormation API. Watch and poll for a successful deployment. Collect all the CloudFormation outputs and make them available to downstream jobs. 3) Trigger in parallel the HA NAT and Bastion instances by passing the outputs generated when creating the VPC. These parallel jobs are blocking the VPC job until they complete successfully. 4) Run the necessary tests on the HA NAT and Bastion nodes and if successful declare the job as complete. 5) The VPC template has been patiently waiting for the downstream jobs to complete. It gathers any of the outputs necessary from the HA NAT and Bastion jobs and then triggers the Elasticbeanstalk Variables job. 6) The Elasticbeanstalk Variables job gathers all the outputs created in this short demo and makes them available in an S3 bucket so our developers just have to pull the file down into .ebextensions in Elasticbeanstalk without having to know any of the VPC, Subnet or routing information to deploy their applications.

While simple, it is a very powerful method for maintaining and managing an environment, not just deploying. For example, if I wanted to replace the HA NAT job with a new job that uses the AWS NAT GW I simply create a new job and set that as the downstream job instead. I delete the HA NAT stack in CloudFormation and then simply rerun the Jenkins job. The last job collects together all the new variables that were generated and makes the new variables available for our development team to use.

Where this gets immensely powerful is having a shared folder of Jenkins jobs and templates that are used over and over again, and by simply setting the jobs as downstream jobs you can create completely different infrastructures and combinations and have a nicely visualized representation of what did and what didn't deploy successfully. Couple this with the ability to hook Jenkins into GIT and I can pull different versions of the jobs in at run time and have a fully version controlled visual pipeline of my infrastructure.

2nd Maturity Step - Embracing A Dynamic Environment

While the above is a massive leap forward not only for visualization, single click deployments and for empowering our developers, the need still existed for CloudFormation templates to be managed. Openstack HEAT has an edge here as their API supports JSON as well as YAML, but at the time of writing CloudFormation only supports JSON. While JSON is a well documented and supported data format, it can get unwieldy very quickly. Good IDE's like Intellij from Jetbrains with the CloudFormation plugin can help and provide a lot of syntax highlighting. But the challenge still remains the same - how do I enable my developers to make changes quickly?

There are a few different thoughts around this subject. The first is to use CloudFormation for what is considered static and then use the API calls or Console to add what you need. The second is to take the Mirantis Murano and Troposphere approach which is to compile a CloudFormation template on the fly based on another 'recipe' or mechanism describing your desired end result. The third is to use a completely different tool.

We have used all three methods during our work with customers, and to be honest choosing the right tool is more key when orchestrating the dynamic components of the project than how the original static infrastructure was stood up. Within 2 days of starting a project with a customer we typically have all the CloudFormation templates and Jenkins jobs mapped out. However, the choice of tool or mechanism to provide the dynamic changes that are required throughout the life cycle and operations of an architecture will be the make or break of the project's success. We are pulled in to many projects in which the customers have automated their initial infrastructure using CloudFormation and attempted to continue using CloudFormation to create many dynamic changes to their environment which has invariably caused irreparable damage.

There are many tools that can help in this space; Chef, Puppet, Ansible , SaltStack etc all have some ability to drive a cloud in a dynamic way. We can even go down the route of using python libraries like boto3 of the Go AWS sdk. However, there is always a need to create a 'recipe' for what you want, and then a 'record' of what actually got deployed. In addition, if the tool has a mechanism to show you what changes it is about to make before you accidentally say goodbye to your infrastructure, then that is gold dust to those in an operational role; empowering the operations team to review before and approve any production changes before pressing the button.

Introducing HashiCorp Terraform

We originally started down the route of building our own library and own DSL to describe what we wanted and how we wanted it to be deployed. This was great until we were asked to work on a non-AWS project. CloudFormation doesn't work across all cloud environments but more importantly our value to our customers is that we can visualize, communicate and help execute towards their vision across any cloud environments. But having a single tool that can deliver all things to all men just doesn't exist.

A year ago the landscape was a very different place. Tools were immature, the usual suspects focused on orchestrating EC2 workloads, not some of the more AWS specialized services that only AWS offers. More importantly, vendors were focusing on AWS and a handful of other cloud providers and even then the capabilities were pretty thin.

In the last year, Hashicorp have continued to embody what 'cloud' is really all about, and consequently have become a massive driving force behind what is required from a DevOps tool chain. For those that aren't familiar with HashiCorp, these are the guys behind Vagrant, Packer and Consul.

Terraform is their Infrastructure as Code tool-set. They officially support 23 providers including DigitalOcean, AWS, Azure, Google CE and VMWare with many other provider modules continually being contributed by the community. In addition, their DSL is simple, elegant, backwards compatible with JSON and I could very simply create a recipe that can automate multiple providers pushing and pulling variables from each cloud provider into the next to create a fully integrated hybrid solution. If I want to deploy an application in AWS that authenticates against Azure I can now do that with a common tool-set.

Most importantly, the mechanism to create, alter and update infrastructure no longer requires the creation of a large template file and the language is consistent across not just across all supported providers, but across the Hashicorp eco system.

Quick overview of Terraform

Terraform works by orchestrating and graphing the dependencies from a set of template files. Each file can contain whatever resource descriptions makes sense for what you are building. You could create one single Terraform file that described Variables, Resources and Outputs, or if you're looking for a high percentage of pattern reuse you could create multiple files describing individual resources that could be used over and over again.

When creating the terraform 'plan' all files in the directory are read in and a graph of what is going to be deployed is created. This means there is no need to link or import your files; just by having files in your directory will ensure they are included in the overall graph.

Our Jenkins CloudFormation deployment in Terraform now looks like this:

Each one of these files is about 10 lines long, and each contains simply the description of the resources it is responsible for deploying. userdata.tpl is not a Terraform file as such but is a template file that describes the user data that will be needed to deploy Jenkins and injected into the EC2 instance when it starts. Being a template, the same userdata file can be deployed in almost ANY cloud environment with little or no changes.

One of the most problematic areas of CloudFormation is the userdata resource and the syntax that is uses. Everything must be escaped or converted to base64. This makes it particularly tricky when orchestrating Microsoft Windows using Powershell-based user data as the escaping can get more than a little out of hand. Using a template file enables us to define what we want to execute when the instance boots and then render a working version of the userdata by injecting variables into the template before it is included. These variables could be anything that needs to be dynamically created or even learned from the other AWS resources that have been deployed in the other templates.

A simple use case we use all the time is to enable our AWS EC2 instances to be automatically added to an AWS ECS cluster so that we can deploy docker containers using the ECS tool set. The snippet below would be the userdata.tpl to replace the name of the cluster on start:

#!/bin/bash echo ECS_CLUSTER=tf-${stackname}-${zone}-ecs > /etc/ecs/ecs.config echo ECS_ENGINE_AUTH_TYPE=dockercfg >> /etc/ecs/ecs.config

We would replace the two tokens ${stackname} and ${zone} when the template is rendered allowing this user data to be used over and over again in different projects.

3rd Maturity step - Empower Developers with Terraform

Returning to a key use case, we want to be able to handle dynamic components in a reliable and repeatable way, but we don't really want our developers to have to learn a new DSL unnecessarily, nor do we want to edit a file unnecessarily. Everyone knows that you risk breaking something when ever you edit something that already worked.

Terraform provides a simple, yet elegant solution for this. Just by including new files in the directory will enable new resources to be added to the the deployment. This creates immense flexibility in which I can happily enable my developers to create files with their needs and have them included rather than edit something that is supporting a production deployment.

Using the S3 example, I want to enable my developers to create, update and remove S3 buckets at will. This needs to be repeatable but I also want to keep the 'released' and stable templates controlled. Terraform provides me a simple mechanism to create a modularized deployment of stable templates that mimic production and still empower my developers and customers to create AWS resources in an automated fashion just by creating a new file.

From our point of view, Terraform gives us a perfect balance between what has been approved, stabalized and released, but then also empowers our developers to add and release new functionality in a way that doesn't risk impacting what has been deployed or marked as stable and running in production.