Back To All

Saving AWS Costs While Your Developers Sleep: Project Ambien

By: DJ Spatoulas

Published: 6 Jun 2022
Last Updated: 12 Aug 2022

KnowBe4 Engineering heavily uses On-Demand environments for quick iterations on native cloud-based products and services. It is typical for some of our senior engineers to maintain multiple environments at the same time, to isolate the development and testing of different projects that are in progress at the same time. Unfortunately, the cost of the infrastructure resources needed to power some of these environments can add up quickly if not properly monitored and constantly adjusted over time.

Although many of our engineering teams have adopted serverless architectures and patterns for new projects, we still maintain a variety of container-based systems and workloads that run in Fargate and need to be "On-Demandable" in order to properly fulfill the desired experience for our engineers at KnowBe4. Once we had a number of ECS and Fargate based services running in On-Demand environments, we identified a major opportunity to reduce infrastructure costs. If we could figure out a way to control the state (on/off) of the services for each an On-Demand, we could significantly reduce our total infrastructure spend by preventing the services from running outside business hours when no one was going to use them.

Off-the-Shelf Solutions

Previously, we had used the EC2 Scheduler in another project to manage EC2 hosts, but it did not support ECS/Fargate infrastructure. After looking into the AWS Instance Scheduler, we noticed the addition of RDS support but the new tool from AWS still did not support container-based workloads. There were no open source or AWS solutions available that would enable us to follow the best practice of only paying for containers we were using, so we needed to build something internally to solve the problem. We decided to use a light-weight tagging approach, similar to the ones featured in the AWS implementations mentioned above due to the positive feedback from the community in the AWS support forums.

The Decision to Build Our Own

After digging into the solution overview and some examples of how the system works, the Platform Engineering team at KnowBe4 designed a system to start and stop the services running in an On-Demand environment based on a schedule defined by its owner during the creation of the environment. Ideally, the On-Demand service would default to only run the services during business hours unless explicitly changed by one of the environment's maintainers. We predicted that it would be helpful to allow our engineers to avoid having their environments turn off during sensitive times in their development workflows; however, we definitely underestimated the difficulty of "putting web services to sleep" and "waking them up" reliably and at scale.

Due to the nature of microservices architecture, an On-Demand environment is composed of one or more services running in the same environment; consequently, the larger, more complex environments could contain over 10+ services, which meant we needed to avoid increasing the operational overhead on our engineers managing the environments as much as possible. The goal of the Ambien project was to add a scheduling widget to the "Create an On-Demand" form that would allow users to set the scheduled hours of operation for their non-production environments, optimizing our infrastructure spend in a way tha minimized the impact on our users.

The Platform Engineering team at KnowBe4 designed a system, code-named "Ambien", that would be responsible for reading the operational schedule of each active environment, and would start and stop the AWS resources in the environments based on the schedule declared by the owner of the environment. Our implementation features the use of two lambda functions, which act as single-purpose microservices within the Ambien system.

Our Solution

Our design required the provisioning functionality of the On-Demand service to record the schedule details on each of the On-Demand infrastructure resources using a specific set of tags and a specific format for the values of the Ambien-related tags.

  • schedule - The start and end time of the On-Demand environment.
  • scheduleOverride - Do not turn off the On-Demand. Resume normal schedule after value has expired.

The schedule tag contains a start and end time for the component to be up, and is added to each of infrastructure components that can be turned on and off in AWS.

How it Works

When the Ambien service runs, it makes a GetResources request to the ResourceGroupsTaggingApi in order to retrieve a list of ecs:service and rds:cluster AWS resources with a schedule tag.

import boto3
import kb4_logging

from boto3.dynamodb.types import TypeDeserializer
from boto3.dynamodb.transform import TransformationInjector
from knowbe4.ambien.scheduler import (
    models,
    constants,
    utils,
)

logger = kb4_logging.get_logger(__name__)
tags_client = boto3.client("resourcegroupstaggingapi")

def get_aws_resources():
    logger.info(
        f"Gathering AWS Resources with an [{constants.SCHEDULE_TAG_KEY}] tag",
        fields={
            "SCHEDULE_TAG_KEY": constants.SCHEDULE_TAG_KEY,
        },
    )
    args = {
        "TagFilters": [{"Key": constants.SCHEDULE_TAG_KEY}],
        "ResourceTypeFilters": ["ecs:service", "rds:cluster"],
    }
    paginator = tags_client.get_paginator("get_resources")
    service_model = tags_client._service_model.operation_model("GetResources")
    trans = TransformationInjector(deserializer=TypeDeserializer())
    page_count = 0
    aws_resources = {"ecs": [], "rds": []}
    for page in paginator.paginate(**args):
        page_count += 1
        logger.debug(
            f"Parsing results from page [{page_count}]",
            fields={
                "page_count": page_count,
            },
        )
        for item in page["ResourceTagMappingList"]:
            trans.inject_attribute_value_output(page, service_model)
            aws_resource = models.AWSResource.parse(item["ResourceARN"], item["Tags"])
            if aws_resource.service == "rds":
                if aws_resource.resource_id.startswith("cluster-"):
                    logger.skip_warn(
                        f"The cluster [{aws_resource.resource_id}] is a private aws cluster "
                        f"and should not be used in ambien",
                        fields=vars(aws_resource),
                    )
                    continue
            aws_resources[aws_resource.service].append(aws_resource)

    logger.info(
        f"Finished gathering AWS Resources with an [{constants.SCHEDULE_TAG_KEY}] tag",
        fields={
            "SCHEDULE_TAG_KEY": constants.SCHEDULE_TAG_KEY,
        },
    )

    return aws_resources

The service then examines the schedule tag value of the resources returned in the response and determines if any of the resources need to be turned on or off. By comparing the current time to the scheduled hours of operations stored in the schedule tag, the service can deterministically take action and produce the next state (on or off) of the resource.

If a state change is detected, the ambien-scheduler publishes one of the following events to Cerebro, the KnowBe4 Platform's Event Sourcing system:

import enum

class JobType(enum.Enum):
    pass


class ECSJobType(JobType):
    START_ECS_SERVICE = "start-ecs-service-scheduled"
    STOP_ECS_SERVICE = "stop-ecs-service-scheduled"


class RDSJobType(JobType):
    START_RDS_CLUSTER = "start-rds-cluster-scheduled"
    STOP_RDS_CLUSTER = "stop-rds-cluster-scheduled"

The ambien-worker service is subscribed to these event types in Cerebro using the subscription configuration shown below:

{
  "account": "AWS-Account-ID",
  "source": ["krn:production:us-east-1:ambien:scheduler"]
}

Once a message is received by the ambien-worker lambda function, it is able to process each scheduled event and then publish additional events based on the final outcome of the worker.

import kb4_logging

from ambien_common.client.cerebro import CerebroClient
from ambien_common.models import (
    ECSJobType,
    RDSJobType,
    ScheduledJobECS,
    ScheduledJobRDS,
)
from knowbe4.ambien.worker import exceptions
from knowbe4.ambien.worker.client import ecs, rds
from knowbe4.ambien.worker.models import (
    AmbienWorkerEvent,
    SuccessEventType,
    FailureEventType,
)
from knowbe4.ambien.worker import settings

logger = kb4_logging.get_logger(__name__)
config = settings.get_config()
cerebro_client = CerebroClient(
    url=config.cerebro_url,
    api_key=config.cerebro_api_key,
)


def start_ecs_service(event_data: dict) -> AmbienWorkerEvent:
    job_detail = ScheduledJobECS(**event_data)

    try:
        ecs.start_ecs_service(
            cluster_arn=job_detail.cluster_arn,
            service_name=job_detail.service_name,
            desired_count=job_detail.desired_count
        )
        return AmbienWorkerEvent(
            event_type=SuccessEventType.ECS_SERVICE_STARTED,
            job_detail=job_detail,
        )
    except exceptions.FailedToStartECSService as error:
        return AmbienWorkerEvent(
            event_type=FailureEventType.ECS_SERVICE_FAILED_TO_START,
            job_detail=job_detail,
            error=error,
        )


def stop_ecs_service(event_data: dict) -> AmbienWorkerEvent:
    job_detail = ScheduledJobECS(**event_data)

    try:
        ecs.stop_ecs_service(
            cluster_arn=job_detail.cluster_arn,
            service_name=job_detail.service_name,
        )
        return AmbienWorkerEvent(
            event_type=SuccessEventType.ECS_SERVICE_STOPPED,
            job_detail=job_detail,
        )
    except exceptions.FailedToStopECSService as error:
        return AmbienWorkerEvent(
            event_type=FailureEventType.ECS_SERVICE_FAILED_TO_STOP,
            job_detail=job_detail,
            error=error,
        )


def start_rds_cluster(event_data: dict) -> AmbienWorkerEvent:
    job_detail = ScheduledJobRDS(**event_data)

    try:
        rds.start_rds_cluster(cluster_id=job_detail.cluster_id)
        return AmbienWorkerEvent(
            event_type=SuccessEventType.RDS_CLUSTER_STARTED,
            job_detail=job_detail,
        )
    except exceptions.FailedToStartRDSCluster as error:
        return AmbienWorkerEvent(
            event_type=FailureEventType.RDS_CLUSTER_FAILED_TO_START,
            job_detail=job_detail,
            error=error,
        )


def stop_rds_cluster(event_data: dict) -> AmbienWorkerEvent:
    job_detail = ScheduledJobRDS(**event_data)

    try:
        rds.stop_rds_cluster(cluster_id=job_detail.cluster_id)
        return AmbienWorkerEvent(
            event_type=SuccessEventType.RDS_CLUSTER_STOPPED,
            job_detail=job_detail,
        )
    except exceptions.FailedToStopRDSCluster as error:
        return AmbienWorkerEvent(
            event_type=FailureEventType.RDS_CLUSTER_FAILED_TO_STOP,
            job_detail=job_detail,
            error=error,
        )

def handler(event: dict, _context):
    job_type = event["detail-type"]
    event_data = event["detail"]["data"]
    log_fields = {
        "event_id": event["id"],
        "job_type": job_type,
    }
    logger.info(
        f"Processing Cerebro event",
        fields=log_fields,
    )
    
    job_result: AmbienWorkerEvent = {
        ECSJobType.START_ECS_SERVICE.value: start_ecs_service,
        ECSJobType.STOP_ECS_SERVICE.value: stop_ecs_service,
        RDSJobType.START_RDS_CLUSTER.value: start_rds_cluster,
        RDSJobType.STOP_RDS_CLUSTER.value: stop_rds_cluster,
    }[job_type](event_data)

    events = [job_result.to_cerebro_event(config.source_krn)]

    cerebro_client.publish_events(events)

    if job_result.is_failure():
        raise exceptions.AmbienWorkerException(job_result.to_error())

In order to test the new Ambien service against some real world environments, we started an internal beta within R&D to help us test the implementation of the Ambien service in non-demo On-Demand environments. Soon after the feature was released, we added functionality so that an operator of the On-Demand service can prevent having their environment turned off by adding a scheduleOverride value of up to 3 days in the future; therefore, our engineers could avoid turning their environments off when working after hours or over the weekend. After a few more tweaks and the infamous DST-related round of bug fixes, the serverless implementation that our team designed proved to be a cheap, reliable and automated solution to reducing infrastructure costs over time as more engineers come join the team at KnowBe4!

The Outcome

The Ambien service is still running today, and KnowBe4 Engineering continues to leverage the service to reduce infrastructure spend on non-production resources. As the number of team members and environments continue to increase, the project's ROI only continues to increase! During the initial 6 months after the release, we noticed a cost reduction of around $10,000 per month in ECS spend; however, the number of On-Demand environments and running services has significantly increased over the past 2 years as KnowBe4 Engineering has increased the number of team members and the products we are building. After a recent analysis of the current On-Demand environments, the estimated cost of these non-production environments would be exponentially higher without a system like Ambien in place.

We're Hiring!

Interested in designing or contributing to solutions like Ambien? Are you passionate about purpose-driven, highly-productive software development of cloud native products and services? Send us an application! KnowBe4 Engineering is always looking for more talented engineers just like you! Check our open positions on our careers page - www.knowbe4.com/careers.

You could join our amazing software engineering team!

CHECK OUR AVAILABLE JOBS

Featured Posts

Software Architecture and the Art of Doing it Right the First Time

A failing health check fired on November 3rd, 2020, alerting the Platform Engineering teams that…

Read More

Connect With Us