OnDemand Series Part 2: Orchestrating Development Environments


Pay no lots of attention to the man behind the curtain! In case you missed it, check out part 1 here.

In the first part of the series, we looked at how KnowBe4 addresses common issues with test environments by running "On-Demand" copies of our services whenever an engineer has some testing to do. In this part, we'll go over the behind-the-scenes magic that powers these On-Demands, ultimately aiming to provide a seamless experience for our engineers.

Infrastructure as Code

In cloud native development, building systems in an easily-repeatable manner is almost a prerequisite for successful deployments. Engineers use various Infrastructure as Code (or IaC) tools to achieve this success, and for the the purpose of this guide we will focus on Terraform.

Whether you're using Terraform, Cloudformation, AWS CDK, or something completely different, the strategy described here should be very similar. Using variables in our IaC configurations, we are able to create production-identical resources in a development environment, ensuring fewer opportunities for mistakes to be made. We can also iterate on this IaC in these non-production environments in order to reliably test infrastructure changes before they are deployed.

For the purpose of this guide, we are going to describe a simple two-tier web app using a relational database and an EC2 instance. This is a very simplified version of a service, but the principles here are portable into much more complex infrastructure designs.

Naming Resources with Variables

The core tenet that makes On-Demands possible is including variables in the name of resources. In AWS many resources must have unique names. Because of this, each On-Demand is prefixed with a unique name beginning with od-, and this unique name is included in named resources.

Preparing Your Terraform

When deploying resources via Terraform, there are a few prerequisites that you probably want to setup, including remote state (perhaps in S3) as well as having the AWS CLI installed and configured. We've left those steps off for the purpose of this guide, but you can read more about that here and here.

Setting Up Your Database

KnowBe4's strategy for On-Demands employs pre-created resources, both for cost-savings and deployment speed. One major component of this is a relational database, which is expensive to run and can take many minutes to provision. Note that this example does not include any variable naming.

main.tf:

# from https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/db_instance#basic-usage
resource "aws_db_instance" "this" {
  name = terraform.workspace

  allocated_storage    = 10
  engine               = "mysql"
  engine_version       = "5.7"
  instance_class       = "db.t3.micro"
  username             = "kmitnick"
  password             = "KnowBe4TrainingIsTheBest!"
  parameter_group_name = "default.mysql5.7"
  skip_final_snapshot  = true
}

output "rds" {
  value = aws_db_instance.this
}

Then we need to deploy it, assuming we have our AWS provider and state defined in some other file:

$ terraform init
$ terraform workspace new ondemand
$ terraform apply -auto-approve

You should now have an RDS instance up and running!

Adding our EC2 Instance

Moving on to our application! The resources below again assume that you are using some sort of terraform remote state storage, and that you have a userdata.sh script created to configure your application.

We'll start by defining our variables and locals:

variable "workspace_override" {
  type    = string
  default = null

  validation {
    condition     = can(regex("^od-[a-z0-9\-]+", var.workspace_override))
    error_message = "The workspace_override value must only include lowercase alphanumeric characters or hyphens, and must start with 'od-'."
  }
}

locals {
  # local.workspace is either the ondemand name or the actual workspace name
  workspace = var.workspace_override != null ? var.workspace_override : terraform.workspace
}

Now we can add our remote state:

data "terraform_remote_state" "rds" {
  ...
  # Pull remote state for our shared RDS instance via local.workspace
  workspace = local.workspace
}

Finally, we'll create our EC2 instance:

data "aws_ami" "ubuntu" {
  ...
}

# from https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/instance#basic-example-using-ami-lookup
resource "aws_instance" "this" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"

  # Assume we have some userdata.sh template file that is expecting {db_addr} to be specified
  user_data = templatefile("userdata.sh", { db_addr = data.terraform_remote_state.rds.outputs.rds.address})

  tags = {
    # Name will include the name of the OnDemand
    Name = "knowbe4-${terraform.workspace}"
  }
}

Finally, as above we need to deploy it, but this time we will create an On-Demand workspace:

$ export OD_NAME=od-your-knowbe4-training-is-overdue
$ export WORKSPACE_OVERRIDE=ondemand
$ terraform init
$ terraform workspace new $OD_NAME
$ terraform apply -var workspace_override=$WORKSPACE_OVERRIDE -auto-approve

Applying These Principles to Other Resources

By separating shared resources into upstream remote states, these same principles can be applied to any resource in AWS. In a development environment, tons of different things can be shared to achieve ideal cost savings and prevent expensive infrastructure overhead, including databases, cache, S3 buckets, KMS keys, and more.

Just like the Name tag and remote state workspace in the EC2 instance above, it is important to ensure you are using terraform.workspace vs local.workspace correctly. By defining the local the way that we have here, we ensure the IaC is repeatable for both On-Demand and production environments.

Don't Forget About Limits

Ah, the elephant in the room! There is no question that AWS limits can certainly be reached much more quickly when you're running hundreds of OnDemands. Just like you would do with a production workload, it is important to plan ahead and foresee potential problem areas before they happen.

Some common limits that we have hit over the years include S3 buckets, IAM policies, and even security group rules. Using more advanced features of terraform like count and for_each allow you to create multiple copies of certain resources with identical configurations.

Creating these resources with easy-to-hit limits in a shared upstream workspace (along with your other shared resources like the RDS instance in this example) allows you to share a single resource with many or all On-Demands. Things like S3 buckets and KMS keys are easy and low-risk items to share within a development environment. Others, like IAM policies, require additional strategy and planning to make sure you get it right the first time!

Want to Learn More?

If you're interested in learning more about our OnDemand strategy, stay tuned for additional parts in this series.

If you're passionate about solving these types of problems, check out our current open positions and send us your resume!

Topics: Engineering, Software Development

Get the latest about social engineering

Subscribe to CyberheistNews