How to Process Data with Terraform and Lambda

Manav Kohli
Thanx
Published in
6 min readFeb 14, 2019

--

At Thanx, we ingest and process lots of data. Over the past seven years our platform has also grown to support an increasing number of merchants and customers, which in turn has placed a greater strain on our pipelines. Manually moving and sanitizing that data can be both tiresome and costly, so we decided to automate some of that work. Given the size of our small-but-growing engineering team, it’s imperative that we leverage a process that enables each of us to contribute to infrastructure changes in a consistent and verifiable way.

Terraform offers a lightweight solution to version control infrastructure changes and supports multiple cloud service providers. Normally, processing data in S3 buckets would require us to set up lambda functions, SNS topics, event notifications, and IAM roles. Since we’d much rather spend that time improving and expanding our platform, we decided to invest in a solution that enabled deploying infrastructure as code. Because we know we aren’t the only team facing these challenges, we also thought it was important to share some of our learnings with you.

Terraform at a Glance

Terraform enables developers to manage infrastructure resources and reference existing ones with code. Through exposing different deployment stages, it also makes understanding infrastructure changes extremely intuitive.

Please note that this tutorial assumes basic knowledge of Terraform syntax and familiarity with AWS services, as processing data in S3 buckets requires setting up lambda functions, SNS topics, event notifications, and IAM roles. Before continuing, I highly recommend reading through the excellent introduction on the Terraform website (if you’re installing it for the first time, it’s available via Homebrew).

Terraform’s process for deploying infrastructure relies most heavily on the following six commands:

  • init: initializes your Terraform configuration. Non-destructive, required.
  • import: enables existing resources to be managed by Terraform. For example, if you import an EC2 instance then its configuration will be imported into the state. However, if you later decide against managing it with Terraform, removing the reference in the configuration will destroy that instance. Destructive, optional.
  • validate: linting. It will look look through your files and verify that resources are correctly defined and accessed. Non-destructive, optional.
  • plan: builds an execution plan and demonstrates which changes will be run on apply. Non-destructive, optional.
  • apply: executes the plan and deploys the changes. Destructive, required.
  • destroy: removes all resources managed in the configuration. Destructive, optional.

As you can see, the only actions required for deploying infrastructure using Terraform are initializing a configuration and applying the changes. A typical workflow may involve running validate and plan during your development and validation process, and only applying the execution plan in a deploy process.

Infrastructure objects in Terraform are either managed through the configuration (resources) or are placeholders (data sources). A data source can either reference existing infrastructure or a temporary, read-only variable used by another resource. For example, the following declaration will create a new EC2 instance when you run the apply command:

You can verify this by running the plan command, which produces the following output:

Alternatively, identifying something as a data source only requires information for uniquely identifying it:

You can then reference those variables with the following syntax:

Terraform’s capabilities extend far beyond these simple commands. Some common examples include importing environment variables, managing deployment modules, and storing infrastructure state remotely.

System Design

The goal of this system is to automagically move an object from one S3 bucket to two others. With minor changes, you could extend the lambda functions to complete any type of file processing before moving the object. The process involves performing the following:

  1. Notifying an SNS topic when a new object is placed in the S3 source bucket
  2. Triggering lambda functions
  3. Moving objects from the source to target S3 buckets
System Architecture

Environment Setup

This tutorial requires an AWS account with IAM creation privileges (free tier works) and the Terraform CLI. I recommend using Atom since it supports Terraform syntax highlighting. If you’re feeling particularly badass, you can also use vim. Darren Cheng, one of our Co-Founders, has a great starting point if you’re new to it.

By the end of this tutorial, the project should have the following structure:

Terraform treats a single directory as a configuration, so any resources declared within the same directory will be locally accessible in files within the same project path. However, before Terraform can manage resources, it needs to connect to a cloud provider. In the main configuration file, declare the provider:

We’ll also store the the access keys in a dedicated credentials dotfile for connecting to AWS. Create the following file at the root of the project:

Next, load the the keys into your path and build the Terraform environment:

$ source .credentials
$ cd src
$ terraform init

Creating Buckets

We’ll need to create one source and two target buckets. Since AWS only requires provisioning permissions to complete actions on buckets, for now we only need to declare their existence:

Creating the SNS Topic and Bucket Notification

The SNS topic will be used to fanout object creation notifications to the lambda functions. Since it’s acting on the source bucket, it also requires an IAM policy that allows it to publish events triggered from it. Here, we use the Amazon Resource Name (ARN) to identify the source bucket. The following block defines a policy and declares the SNS topic:

Note: the resource name is hardcoded, but it’s better to reference it directly. Unfortunately, in this case Terraform doesn’t like circular references.

Now, we can add a notification to publish each object creation event in the source bucket to the topic:

Setting Up the Lambda Functions

Now all we have to do is write the lambda functions required for moving objects between buckets. There are four steps to completing this:

  1. Writing the source code
  2. Encoding the source code
  3. Declaring the lambda resources
  4. Establishing IAM roles

There are many different ways to deploy lambda source code in AWS, but one of the cleaner ways involves checking it in directly and allowing Terraform to archive and then decode it. This approach especially helps with version control.

For the sake of focusing on the system architecture, these functions are oversimplified and almost exact duplicates of each other. They could easily be extended by filtering the types of objects sent to the topic and then reading and modifying the object.

After including the source code, we need to specify the mechanism for archiving it. This requires adding a new provider as well as data source placeholders:

Now, we can declare the lambda functions:

Finally, the lambda functions require an IAM role for moving objects from the source bucket into the target buckets. This piece has very little documentation, but can cause a slew of issues when running the apply command since IAM roles and other permissions glue everything together. Terraform has excellent documentation on how to create resources, but information on how to connect them is scant. To successfully deploy a lambda function, you need to specify which AWS service the role is provisioned to use (the IAM role policy) and how the function can interact with other AWS services (the policy).

For this tutorial, the IAM role policy should enable the functions to access the AWS Lambda Service and the policy should enable the functions to write logs, read objects from the source bucket, and place objects in the target buckets. While the lambda functions directly reference the IAM role, an additional, discrete role policy enables the IAM role to enact the policy.

Note: the documentation demonstrates writing the permissions directly to the IAM resource, but specifying a data source is a little cleaner.

Putting Everything Together

Now that the pieces are in place, we need to make sure that the lambda functions are privileged to read notifications from the topic. This requires allowing the lambdas to be invoked by SNS as well as establishing the topic subscriptions.

And that’s it! The newly created lambda functions will move objects from the source to the target buckets. To verify, the output of the plan command should provide this final output. The source code in its entirety can be found here.

Questions? Feedback? Looking to join our team now that you know how we deploy infrastructure? Send an email to manav.kohli@thanx.com.

--

--

Amateur enthusiast & approximate knowledge expert. Engineering @Thanx.