Exploring IBM DataStage Next Generation / IBM DataStage on Cloud Pak

Exploring IBM DataStage Next Generation / IBM DataStage on Cloud Pak

🕑 Published : Oct 10, 2021 | 🕣 Updated : Oct 10, 2021 | ⏳ 5 Min read

📁DataStage #️⃣DataStage

Table of Contents

DataStage Next Gen is here now. It is fully managed and cloud-ready data integration SaaS offering from IBM built on containers and microservices. Redesigned DataStage architecture no longer relies on XMETA, WAS for meta data or a windows client for developing jobs. Common Asset Management Infrastructure (CAMS) is the underlying repository for managing DataStage flows, jobs, parameter sets and it is queriable via APIs/CLI. Dataflow job development is done via new web canvas build using open-source project canvas. This is a run down on my exploration of this new and improved version of DataStage.

To get started with IBM DataStage on Cloud Pak (aka IBM DataStage Next Gen), you would need an IBM account.

Navigate to IBM DataStage website and sign up for a no cost trial account

Pick the region closest to you for hosting the services and data

Create a new account if you don’t have an IBM account. Login into your existing account if you already have an IBM account

Once you login into your IBM account, you will be prompted to select the services which you want in your IBM Cloud Pak for Data. Cloud Object Storage and DataStage services are selected by default, additionally Watson services can be selected if required. Since we are here for DataStage, we will not change the default selection, review the selection and click continue

IBM Cloud Pak for Data will be provisioned in few seconds (approx 30 seconds). Click on “Go to IBM Cloud Pak for Data”

This should take you to the welcome page for a guided tour. Take the tour to familiarize yourself around IBM Cloud Pak for Data UI.

Create project

Once you are logged in into your IBM Cloud Pak for Data portal, the next step is to create a project to host the DataStage assets. Each project will have a related Cloud Object Storage area for storing the job logs, flat files and other artifacts which would require a file system.

Click on “Create a project” on the home page for IBM Cloud Pak for Data

You will be presented with options to create a project. Click on “Create an empty project”

Enter a preferred name for your project and click on “Create”

Project will be created shortly and you will be directed to the project overview page. The summary page has a
✅ README section to add documentation for the project using Markdown
✅ Assets section to manage the connections and DataStage flows
✅ Jobs section to manage and review the execution of DataStage flows. Here you can also change the execution environment and the job schedule information
✅ Access section to manage the accesses for the project

Create project assets

In this demo, we will source data from Snowflake, process in DataStage and create a flat file. In the next few steps we will create the Connection to Snowflake and DataStage flow for processing the data.

Everything demo’d here can be done in snowflake itself, but the idea of the demo is to show the usage of DataStage with an example.

Create Connection

On your project overview page, click on “Add to Project” and select “Connection”

Select Snowflake as the connection type and fill in the details. Make sure to use the Snowflake account name with the region and leave the secure gateway unchecked for the demo.

🔗 Snowflake - DataStage user setup

Validate the connection details by clicking on “Test connection”. Click on “Create” to save the connection details after the test is successful.

All connections in the project will be listed under “Assets” –> “Data assets” section of the project.

Create DataStage flow

On your project overview page, click on “Add to Project” and select “DataStage flow”

Give an appropriate name for the DataStage flow and click on “Create”

From the palette, add Snowflake, Sequential file and Transformer stages. Connect the stages as shown below

Edit the Snowflake stage to select the connection which we created in previous step. Update the sql and output column definition details

🔗 Snowflake - Source SQL

One of the good features with DataStage Next Gen is that you no longer have to drag and drop the column definitions to each stages. They are auto propgated to the next stages. We will add a simple constraint and derivation in the output of transfer stage as shown below

In the output sequential file stage, update the target file name

Execute DataStage flow

At this point we are ready to save, compile and execute the jobs. The buttons to save, complie and executing the jobs are located in the top left corner of your DataStage flow screen just like traditional DataStage

You can see the execution status and job logs by clicking on “Logs” button in the top right corner of your Datastage flow screen

The job logs and the outfiles are available in IBM Cloud Object Storage. You can go the Object Storage for your project from the settings page of the Cloud Pak project

We created a DataStage project and DataStage flow in few minutes. Prior to DataStage Next Gen, creating a DataStage environment would take hours worth of efforts and unless you are working in an organization that uses DataStage, getting hands on DataStage to try is not possible. With DataStage Next Gen and IBM DataStage Lite Plan (aka Free tier), you can have your DataStage environment up and running in few minutes and a place to try the new features that DataStage Next Gen has to offer. This is a good start and I am hoping IBM DataStage will continue to evolve at a good pace to add more stages and connectors to the echo system.

Hope this was helpful. Did I miss something ? Let me know in the comments OR in the forum section.