Optimizing High-Performance EC2 Costs with AWS Lambda

| Feb 27, 2024 min read

The backstory

In the world of computing, efficiency is king. As generous as this king may be, it commands a hefty price. As businesses grow, infrastructure costs are often the first to balloon right alongside. I stumbled upon this very issue in my day-to-day operations, where two crucial factors come into play:

  • Processing time
  • More (or should I rather say: less) processing time

Given the GPU-intensive nature of our tasks, we had no choice but to call in the heavy artillery - multi-GPU instances. And you don’t just have to take my word for it; these units come with a steep price tag. Just take a peek at the AWS EC2 pricing for g4dn.12xlarge. At almost 4 bucks an hour, this machine isn’t even the priciest in the lineup. But even so, running it round-the-clock racks up significant expenses. It was clear I had to take action…

The Challenge

Let’s start with what we need. Processing time is key, right? We needed to get things moving ASAP. Our setup used SQS for lining up tasks. I sketched out how everything was connected and spotted a few pinpoints:

  1. Spotting SQS messages that need to be dealt with
  2. Getting the EC2 instance up and running
  3. Figuring out when SQS is all clear of messages
  4. Sometimes turning off the EC2, but not always

Why not always? Because the system kicks off by itself when the EC2 starts (thanks to the systemctl daemon), and we need to keep a door open for fixing and tweaking things. With all this in mind, I had a pretty good idea of what needed to be done.

The Strategy

At this point it was pretty clear to me that what I need is EventBridge-d AWS Lambda setup, plus a little custom scripting on the side. The tricky part was figuring out a simple way to decide if the EC2 should shut down after finishing its job. I wanted to avoid keeping track of things outside the system because that just makes everything more complicated. After a bit of digging and GPT-ing I realized the best approach was to use the EC2’s own tags, which are part of its metadata. Pretty straightforward, right? Here’s a quick rundown.

The Solution

AWS Lambda

Let’s dive into AWS Lambda. I like to keep things simple, so I choose Python with the Boto3 library (AWS’s software development kit) because it’s straightforward and quick to get going with:

import boto3

ec2 = boto3.resource('ec2')
sqs = boto3.client('sqs', region_name=os.environ.get('AWS_DEFAULT_REGION'))
queue_url = "processing_queue"

PROCESSING_SERVER = ec2.Instance(PROCESSING_SERVER_ID)

def lambda_handler(event, context):
    PROCESSING_SERVER.reload()

    if not ec2_is_stopped():
        print("EC2 is not in runnable state, do nothing")
        return

    if jobs_waiting(cursor) > 0:
        start_ec2_instance()


def ec2_is_stopped():
    state = PROCESSING_SERVER.state['Name']
    print(f"EC2 state: {state}")
    return state == 'stopped'


def jobs_waiting(cursor):
    try:
        response = sqs.get_queue_attributes(
            QueueUrl=queue_url,
            AttributeNames=['ApproximateNumberOfMessages']
        )
        return int(response['Attributes']['ApproximateNumberOfMessages'])

    except Exception as e:
        print(f"Error: {e}")
        return 0

def start_ec2_instance():
    print("Starting EC2 instance")
    print(PROCESSING_SERVER.start())

This method is pretty direct. However, I highly recommend adding some monitoring to catch any unexpected issues.

For a complete setup, we also need to schedule when this Lambda function runs. I suggest using EventBridge to trigger the Lambda. Here’s how you can set it up using Terraform:

resource "aws_cloudwatch_event_rule" "this" {
  name = "OneLambdaToRuleThemAll"
  description = "An each minute trigger for lambda controlling EC2 instance"
  schedule_expression = "rate(1 minute)"
}

resource "aws_cloudwatch_event_target" "this" {
  arn   = aws_lambda_function.this.arn // that should be created separately
  rule  = aws_cloudwatch_event_rule.this.name
}

And remember, this Lambda function needs the right IAM permissions to work properly:

data "aws_iam_policy_document" "lambda_assume_role" {
  statement {
    actions = [
      "sts:AssumeRole"
    ]

    principals {
      type        = "Service"
      identifiers = [
        "lambda.amazonaws.com"
      ]
    }
  }
}

data "aws_iam_policy" "AWSLambdaBasicExecutionRole" {
  name = "AWSLambdaBasicExecutionRole"
}


resource "aws_iam_policy" "this" {
  name   = "trigger_lambda_policy"
  policy = jsonencode(
    {
      "Version"   = "2012-10-17",
      "Statement" = [
        {
          "Effect"   = "Allow",
          "Action" = [
            "ec2:StartInstances",
          ],
          "Resource" = [
            "arn:aws:ec2:${var.region}:${var.aws_user_id}:instance/${var.processing_server_instance_id}"
          ]
        },
        {
          "Effect"   = "Allow",
          "Action" = [
            "ec2:DescribeInstances",
          ],
          "Resource" = [
            "*"
          ]
        },
        {
          Effect : "Allow",
          Action : [
            "sqs:SendMessage",
            "sqs:ReceiveMessage",
            "sqs:DeleteMessage",
            "sqs:GetQueueAttributes"
          ]
          Resource : var.sqs_arn
        }
      ]
    }
  )
}


resource "aws_iam_role" "this" {
  name                  = "trigger_lambda_role"
  assume_role_policy    = data.aws_iam_policy_document.lambda_assume_role.json
  force_detach_policies = true
  managed_policy_arns   = [
    data.aws_iam_policy.AWSLambdaBasicExecutionRole.arn,
    aws_iam_policy.this.arn
  ]
}

This isn’t a deep dive into Terraform, so I’ll leave it at that. With this setup, we’ve got a simple, efficient, and cost-effective solution to trigger our Lambda function

EC2

Before we jump in, let me set the record straight. NodeJS isn’t usually my go-to. Coming from a Java/Kotlin background, NodeJS’s relaxed approach to typing isn’t really my cup of tea. We could get into a debate about whether its flexibility with types is a pro or a con, but that’s a topic for another day ;)

I ended up using NodeJS simply because it was already part of the setup. And honestly, it gets the job done. Here’s the game plan:

  1. Continuously check for SQS messages (you can get fancy with consumers if you like)
  2. If there’s no message, bump up an internal counter. We’re not giving up after just one try, right?
  3. If there’s still no sign of messages after a while, it’s time to trigger a self-shutdown.

Now, let’s dive into the details:

import * as aws from "aws-sdk";


TRY_MESSAGE_COUNT = 0
const handleSQSPacket = (err, data) => {
    if (err) {
        log.error(err);
        return;
    }
    if (!Array.isArray(data.Messages)) {
        TRY_MESSAGE_COUNT++;
        if (TRY_MESSAGE_COUNT > MAX_TRIES_FOR_MESSAGE) {
            shutdownEC2Self();
        }
        return;
    }
    TRY_MESSAGE_COUNT = 0;
    data.Messages.forEach((msg) => {
        handleMessage(msg);
    });
};

const shutdownEC2Self = () => {
    if (process.env.NO_STOP) {
        return;
    }
    new aws.MetadataService().request(
        "/latest/meta-data/instance-id",
        (error, data) => {
            if (error) {
                log.error(JSON.stringify(error));
                return;
            }
            isAutoStopEnabled(data)
                .then((enabled) => {
                    if (!enabled) {
                        return;
                    }
                    log.info(`Shutting down EC2 instance {instanceId=${data}}`);
                    const ec2 = new aws.EC2();
                    ec2.stopInstances({ InstanceIds: [data] }, (err, data) => {
                        if (err) {
                            log.error(JSON.stringify(err));
                        } else {
                            log.info(JSON.stringify(data));
                        }
                    });
                })
                .catch((error) => {
                    log.error(JSON.stringify(error));
                });
        },
    );
};

const isAutoStopEnabled = async (instanceId: string): Promise<boolean | null> => {
    const ec2 = new aws.EC2();
    const params = {
        Filters: [
            {
                Name: "resource-id",
                Values: [instanceId],
            },
        ],
    };

    try {
        const response = await ec2.describeTags(params).promise();
        const autoStopEnabledTag = response.Tags.find(
            (tag) => tag.Key === "AutoStop",
        );
        return autoStopEnabledTag
            ? autoStopEnabledTag.Value.toLowerCase() === "true"
            : null;
    } catch (error) {
        log.error("Error retrieving tags:", error);
        return null;
    }
};

The method here is really straightforward again. I’ve set up a tag named “AutoStop” on the EC2 instance, which you can easily adjust whenever you need to. Just head over to the EC2 console, find your instance, and navigate to Instance State -> Instance Settings -> Manage Tags.

Conclusion

That’s a wrap! We’ve explored some handy tricks to make EC2 instances more cost-effective with AWS Lambda. From using simple tags to setting up automated triggers, it’s all about making things run smoother without breaking the bank. Hope these tips help you out. Catch you later!

And as always, thanks for sticking with me to the end.

Jed