preloader
blog-post

How to load XML to Redshift from S3 using AWS Lambda

Table of Contents

In this article we will see how to load XML from an s3 bucket into Redshift using AWS Lambda and Redshift copy commands. AWS Lambda code is written in python. Code used in this article can be found here. Clone the git repo to get started.

Prerequisite

  • Download and install docker for your platform. Click here for instructions
  • Create AWS account for the demo. Click here for instructions

Create IAM user

We will see the steps to set up AWS SAM. Please skip next two sections before Create AWS resources if you have a machine with SAM configured and an IAM user with permissions required for SAM.

  • Create an IAM user in AWS with required access. Click here for more details about the required permissions for SAM. The IAM user used for this demo has following policies attached:
    ❎ IAMFullAccess
    ❎ AmazonS3FullAccess
    ❎ AmazonEventBridgeFullAccess
    ❎ AWSCloudFormationFullAccess
    ❎ AWSLambda_FullAccess
    ❎ AmazonEC2FullAccess
    ❎ AmazonRedshiftFullAccess

Access used for demo is pretty open, please adjust them as per your use case. Deployment should happen from your CICD pipeline and its natural for CICD user to have these elevated access.

  • Navigate to Security credentials tab of newly created IAM user and create an access key. The access key will be used by SAM to deploy the resources to AWS.

Start Docker containers

We will use SAM framework and localstack in docker for development and testing the python lambda.

  • Clone this repo

  • Open a new terminal and cd into aws-tools directory. This directory contains Dockerfile and docker-compose.yml to bring up a docker container with AWS SAM and localstack.

  • Create a copy of .env.template as .env

  • Update the .env with the access key from previous step

  • Start aws-tools container by running

    docker-compose up -d
    
  • Validate the container by running

    docker ps
    

Validate AWS CLI, SAM and Localstack

  • SSH into the container by running

    docker exec -it aws-tools /bin/bash
    
  • Validate AWS CLI by running

    aws --version
    
  • Validate AWS SAM by running

    sam --version
    
  • Validate your AWS profile by running

    aws configure list
    

    You can also override the profile OR create a new profile at any point by running aws configure OR aws configure --profile <profile-name>

  • Validate localstack by running

    curl http://localstack:4566 && echo -e ""
    

Create AWS resources

Let’s create following AWS resources required for this demo using AWS SAM.

✅ AWS Lambda to read xml from s3, convert xml to json and write json to s3
✅ S3 bucket and bucket policy
✅ Redshift cluster and Redshift role

SAM helps to create serverless application that you can package and deploy in AWS Cloud. AWS Lambda function to read xml from s3, convert xml to json and write json to s3 is written in python. Click the links below to review the code used

🔗 AWS Lambda(Python Module)
🔗 AWS SAM Template

SAM Build

  • cd in the SAM project directory sam-lambda-xml2json

    cd /C/sam-lambda-xml2json/
    
  • cd into the SAM project directory and validate SAM project by running

    sam validate
    
  • Build the SAM project by running

    sam build
    

SAM Test locally

Commands in this section are executed from the aws-tools container. Alternatively, you can also SSH into the localstack container and run the CLI command using awslocal. In the next few steps awslocal commands is also captured for reference.

  • Create s3 bucket in localstack by running

    aws --endpoint-url=http://localstack:4566 s3api create-bucket --bucket sam-lambda-xml2json --region us-east-1
    

    localstack container command : awslocal s3 mb s3://sam-lambda-xml2json

  • Copy sample xml file to s3 bucket by running

    aws --endpoint-url=http://localstack:4566 s3api put-object --bucket sam-lambda-xml2json  --key xml/breakfast_menu.xml --body sample-xml/breakfast_menu.xml
    

    localstack container command : awslocal s3 cp sample-xml/breakfast_menu.xml s3://sam-lambda-xml2json/xml/breakfast_menu.xml

  • Make sure the s3 bucket in localstack has xml file by running

    aws --endpoint-url=http://localstack:4566 s3api list-objects-v2 --bucket sam-lambda-xml2json
    

    localstack container command : awslocal s3 ls s3://sam-lambda-xml2json/ --recursive --human-readable --summarize

  • Execute the lambda locally by running

    sam local invoke --container-host host.docker.internal -e events/events.json
    

    Here events.json is a sample event configured with sam-lambda-xml2json as bucket name and breakfast_menu.xml as key

  • Validate the generated JSON file by downloading it from localstack s3 bucket to local directory

    aws --endpoint-url=http://localstack:4566 s3api get-object --bucket sam-lambda-xml2json --key json/breakfast_menu.json breakfast_menu.json
    

    localstack container command : awslocal s3 cp s3://sam-lambda-xml2json/json/breakfast_menu.json breakfast_menu.json

  • Execute the lambda locally by passing parameter for environment. Lambda will try to read the s3 bucket in AWS and execution will fail gracefully when an environment other than local is passed. This is expected result

    sam local invoke --container-host host.docker.internal -e events/events.json --parameter-overrides ParamEnvironment="dev"
    

SAM Deploy

  • Deploy SAM project by running below command. IAM user for SAM is required for this step

    sam deploy --guided --capabilities CAPABILITY_NAMED_IAM
    

    Guided deployment option is best for first deploy. Make sure to keep a note of parameters and the managed S3 bucket name so they can be used in future non guided deployments.

  • Future deploys can be done using below command

    sam deploy --stack-name sam-lambda-xml2json --s3-bucket <sam-bucket-name> --capabilities CAPABILITY_NAMED_IAM --parameter-overrides ParamEnvironment="prod"
    

Validate AWS resources

Successful deployment will create the AWS resources. Navigate to Cloudformation in AWS console and confirm the status

Convert XML to JSON

Let’s upload some sample xml to the s3 bucket. The upload should trigger the lambda code to convert xml to JSON

  • Upload xml to s3 bucket into a directory named xml

  • This instantly creates a directory named json with the generated json files in them

  • Validate the lambda execution log by navigating to CloudWatch logs

Load JSON to Redshift

  • We will see couple of options to the JSON into a Redshift table. One using jsonpath which works well for simple json and the other one using SUPER data type which works well for nested/complex json. Choose the one which fits for your use case

  • Navigate to Amazon Redshift dashboard –> Query editor to get started

  • Click connect and run, when running a query for first time you can connect using a temporary credential

Using Copy and jsonpath

  • Create a table for copying the parsed JSON data

    CREATE TABLE breakfast_menu (
      food_name VARCHAR,
      food_price VARCHAR,
      food_description VARCHAR
      );
    
  • Create a jsonpath for the JSON message and upload the same to jsonpath directory in same s3 bucket.

    {
      "jsonpaths": [
          "$['breakfast_menu']['food'][0]['name']",
          "$['breakfast_menu']['food'][0]['price']",
          "$['breakfast_menu']['food'][0]['description']"
      ]
    }
    
  • Copy JSON from S3 to Redshift table using jsonpath option. With this option, COPY uses the named JSONPaths file to map the data elements in the JSON source data to the columns in the target table.

    COPY breakfast_menu
    FROM 's3://sam-lambda-xml2json/json/' 
    CREDENTIALS 'aws_iam_role=<iam-role-for-redshift>' 
    json 's3://sam-lambda-xml2json/jsonpath/jsonpath_breakfast_menu.json';
    
  • Validate parsed JSON

    SELECT *
    FROM breakfast_menu
    WHERE food_name IS NOT NULL;
    

As we see here json path requires array index which makes it little complex for nested JSON.

Using Copy and SUPER type

  • Create table with SUPER type.

    CREATE TABLE raw_event (event SUPER);
    
  • Copy JSON from S3 to Redshift table using noshred option. Amazon Redshift doesn’t shred the attributes of JSON structures into multiple columns while loading a JSON document

    COPY raw_event
    FROM 's3://sam-lambda-xml2json/json/breakfast_menu.json' 
    REGION 'us-east-1' 
    IAM_ROLE '<iam-role-for-redshift>' 
    FORMAT JSON 'noshred';
    
  • Validate raw event

    SELECT re.*
    FROM raw_event re;
    
  • Parse the JSON using bracket and dot notation. Bracket and dot notation works well for simple JSON just like jsonpath but not great for nested JSON.

    SELECT event.breakfast_menu.food [0].name,
      event.breakfast_menu.food [0].price,
      event.breakfast_menu.food [0].description
      FROM raw_event;
    
  • Parse the nested JSON using PartiQL

    SELECT bmf.name,
      bmf.price,
      bmf.description
      FROM raw_event re,
      	re.event.breakfast_menu.food bmf;
    
  • Parse and materialize the JSON into a a view.

    CREATE MATERIALIZED VIEW breakfast_menu_view AUTO REFRESH YES AS
    SELECT bmf.name,
      bmf.price,
      bmf.description
    FROM raw_event re,
      re.event.breakfast_menu.food bmf;
    
    SELECT * FROM breakfast_menu_view;
    

Clean Demo Resources

  • Empty the S3 bucket

  • Delete all AWS resources by running

aws cloudformation delete-stack --stack-name sam-lambda-xml2json

We saw how convert an XML into JSON using python Lambda function and load the JSON into redshift table using copy commands. The python code used here can be customized to run in your environment locally with some small changes.

Hope this was helpful. Did I miss something ? Let me know in the comments OR in the forum section.

References

Share this blog:
Comments

Related Articles