How to load XML to Redshift from S3 using AWS Lambda

How to load XML to Redshift from S3 using AWS Lambda

🕑 Published : Sep 18, 2021 | 🕣 Updated : Sep 18, 2021 | ⏳ 7 Min read

📁AWS #️⃣Docker AWS

Table of Contents

In this article we will see how to load XML from an s3 bucket into Redshift using AWS Lambda and Redshift copy commands. AWS Lambda code is written in python. Code used in this article can be found here. Clone the git repo to get started.

Prerequisite

Download and install docker for your platform. Click here for instructions
Create AWS account for the demo. Click here for instructions

Create IAM user

We will see the steps to set up AWS SAM. Please skip next two sections before Create AWS resources if you have a machine with SAM configured and an IAM user with permissions required for SAM.

Create an IAM user in AWS with required access. Click here for more details about the required permissions for SAM. The IAM user used for this demo has following policies attached:
❎ IAMFullAccess
❎ AmazonS3FullAccess
❎ AmazonEventBridgeFullAccess
❎ AWSCloudFormationFullAccess
❎ AWSLambda_FullAccess
❎ AmazonEC2FullAccess
❎ AmazonRedshiftFullAccess

Access used for demo is pretty open, please adjust them as per your use case. Deployment should happen from your CICD pipeline and its natural for CICD user to have these elevated access.

Navigate to Security credentials tab of newly created IAM user and create an access key. The access key will be used by SAM to deploy the resources to AWS.

Start Docker containers

We will use SAM framework and localstack in docker for development and testing the python lambda.

Clone this repo
Open a new terminal and cd into aws-tools directory. This directory contains Dockerfile and docker-compose.yml to bring up a docker container with AWS SAM and localstack.
Create a copy of .env.template as .env
Update the .env with the access key from previous step
Start aws-tools container by running
```
docker-compose up -d
```
Validate the container by running
```
docker ps
```

Validate AWS CLI, SAM and Localstack

SSH into the container by running
```
docker exec -it aws-tools /bin/bash
```
Validate AWS CLI by running
```
aws --version
```
Validate AWS SAM by running
```
sam --version
```
Validate your AWS profile by running
```
aws configure list
```
You can also override the profile OR create a new profile at any point by running aws configure OR aws configure --profile <profile-name>

Validate localstack by running

curl http://localstack:4566 && echo -e ""

Create AWS resources

Let’s create following AWS resources required for this demo using AWS SAM.

✅ AWS Lambda to read xml from s3, convert xml to json and write json to s3
✅ S3 bucket and bucket policy
✅ Redshift cluster and Redshift role

SAM helps to create serverless application that you can package and deploy in AWS Cloud. AWS Lambda function to read xml from s3, convert xml to json and write json to s3 is written in python. Click the links below to review the code used

🔗 AWS Lambda(Python Module)
🔗 AWS SAM Template

SAM Build

cd in the SAM project directory sam-lambda-xml2json
```
cd /C/sam-lambda-xml2json/
```
cd into the SAM project directory and validate SAM project by running
```
sam validate
```
Build the SAM project by running
```
sam build
```

SAM Test locally

Commands in this section are executed from the aws-tools container. Alternatively, you can also SSH into the localstack container and run the CLI command using awslocal. In the next few steps awslocal commands is also captured for reference.

Create s3 bucket in localstack by running

aws --endpoint-url=http://localstack:4566 s3api create-bucket --bucket sam-lambda-xml2json --region us-east-1

localstack container command : awslocal s3 mb s3://sam-lambda-xml2json

Copy sample xml file to s3 bucket by running
```
aws --endpoint-url=http://localstack:4566 s3api put-object --bucket sam-lambda-xml2json  --key xml/breakfast_menu.xml --body sample-xml/breakfast_menu.xml
```
localstack container command : awslocal s3 cp sample-xml/breakfast_menu.xml s3://sam-lambda-xml2json/xml/breakfast_menu.xml
Make sure the s3 bucket in localstack has xml file by running
```
aws --endpoint-url=http://localstack:4566 s3api list-objects-v2 --bucket sam-lambda-xml2json
```
localstack container command : awslocal s3 ls s3://sam-lambda-xml2json/ --recursive --human-readable --summarize
Execute the lambda locally by running
```
sam local invoke --container-host host.docker.internal -e events/events.json
```
Here events.json is a sample event configured with sam-lambda-xml2json as bucket name and breakfast_menu.xml as key
Validate the generated JSON file by downloading it from localstack s3 bucket to local directory
```
aws --endpoint-url=http://localstack:4566 s3api get-object --bucket sam-lambda-xml2json --key json/breakfast_menu.json breakfast_menu.json
```
localstack container command : awslocal s3 cp s3://sam-lambda-xml2json/json/breakfast_menu.json breakfast_menu.json
Execute the lambda locally by passing parameter for environment. Lambda will try to read the s3 bucket in AWS and execution will fail gracefully when an environment other than local is passed. This is expected result
```
sam local invoke --container-host host.docker.internal -e events/events.json --parameter-overrides ParamEnvironment="dev"
```

SAM Deploy

Deploy SAM project by running below command. IAM user for SAM is required for this step
```
sam deploy --guided --capabilities CAPABILITY_NAMED_IAM
```
Guided deployment option is best for first deploy. Make sure to keep a note of parameters and the managed S3 bucket name so they can be used in future non guided deployments.

Future deploys can be done using below command

sam deploy --stack-name sam-lambda-xml2json --s3-bucket <sam-bucket-name> --capabilities CAPABILITY_NAMED_IAM --parameter-overrides ParamEnvironment="prod"

Validate AWS resources

Successful deployment will create the AWS resources. Navigate to Cloudformation in AWS console and confirm the status

Convert XML to JSON

Let’s upload some sample xml to the s3 bucket. The upload should trigger the lambda code to convert xml to JSON

Upload xml to s3 bucket into a directory named xml
This instantly creates a directory named json with the generated json files in them
Validate the lambda execution log by navigating to CloudWatch logs

Load JSON to Redshift

We will see couple of options to the JSON into a Redshift table. One using jsonpath which works well for simple json and the other one using SUPER data type which works well for nested/complex json. Choose the one which fits for your use case
Navigate to Amazon Redshift dashboard –> Query editor to get started
Click connect and run, when running a query for first time you can connect using a temporary credential

Using Copy and jsonpath

Create a table for copying the parsed JSON data

CREATE TABLE breakfast_menu (
  food_name VARCHAR,
  food_price VARCHAR,
  food_description VARCHAR
  );

Create a jsonpath for the JSON message and upload the same to jsonpath directory in same s3 bucket.

{
  "jsonpaths": [
      "$['breakfast_menu']['food'][0]['name']",
      "$['breakfast_menu']['food'][0]['price']",
      "$['breakfast_menu']['food'][0]['description']"
  ]
}

Copy JSON from S3 to Redshift table using jsonpath option. With this option, COPY uses the named JSONPaths file to map the data elements in the JSON source data to the columns in the target table.
```
COPY breakfast_menu
FROM 's3://sam-lambda-xml2json/json/' 
CREDENTIALS 'aws_iam_role=<iam-role-for-redshift>' 
json 's3://sam-lambda-xml2json/jsonpath/jsonpath_breakfast_menu.json';
```

Validate parsed JSON

SELECT *
FROM breakfast_menu
WHERE food_name IS NOT NULL;

As we see here json path requires array index which makes it little complex for nested JSON.

Using Copy and SUPER type

Create table with SUPER type.
```
CREATE TABLE raw_event (event SUPER);
```
Copy JSON from S3 to Redshift table using noshred option. Amazon Redshift doesn’t shred the attributes of JSON structures into multiple columns while loading a JSON document
```
COPY raw_event
FROM 's3://sam-lambda-xml2json/json/breakfast_menu.json' 
REGION 'us-east-1' 
IAM_ROLE '<iam-role-for-redshift>' 
FORMAT JSON 'noshred';
```
Validate raw event
```
SELECT re.*
FROM raw_event re;
```

Parse the JSON using bracket and dot notation. Bracket and dot notation works well for simple JSON just like jsonpath but not great for nested JSON.

SELECT event.breakfast_menu.food [0].name,
  event.breakfast_menu.food [0].price,
  event.breakfast_menu.food [0].description
  FROM raw_event;

Parse the nested JSON using PartiQL

SELECT bmf.name,
  bmf.price,
  bmf.description
  FROM raw_event re,
  	re.event.breakfast_menu.food bmf;

Parse and materialize the JSON into a a view.

CREATE MATERIALIZED VIEW breakfast_menu_view AUTO REFRESH YES AS
SELECT bmf.name,
  bmf.price,
  bmf.description
FROM raw_event re,
  re.event.breakfast_menu.food bmf;

SELECT * FROM breakfast_menu_view;

Clean Demo Resources

Empty the S3 bucket
Delete all AWS resources by running

aws cloudformation delete-stack --stack-name sam-lambda-xml2json

We saw how convert an XML into JSON using python Lambda function and load the JSON into redshift table using copy commands. The python code used here can be customized to run in your environment locally with some small changes.

Hope this was helpful. Did I miss something ? Let me know in the comments OR in the forum section.

References

Comments

How to pull data from API using AWS Lambda & stream to Snowflake using Snowpipe

In this article we will see how to get/pull data from an api, store in S3 and then stream the same data from S3 to …

Siva Nadesan

Sep 06, 2021 - 7 Min read

#Docker #Snowflake #AWS

AWS SAM in a Docker container

In this article we will see how to install AWS SAM in a docker container and to create an AWS Lambda using AWS SAM. …

Siva Nadesan

Sep 05, 2021 - 3 Min read

#Docker #AWS