Exploring IBM DataStage Next Generation / IBM DataStage on Cloud Pak

  • Home
  • /
  • Exploring IBM DataStage Next Generation / IBM DataStage on Cloud Pak
Exploring IBM DataStage Next Generation / IBM DataStage on Cloud Pak

Exploring IBM DataStage Next Generation / IBM DataStage on Cloud Pak

Data stage Published 10 oct 2021 Last Updated 10 oct 2021 Siva Nadesan
Table of Contents

Overview

DataStage Next Gen is here now. It is fully managed and cloud-ready data integration SaaS offering from IBM built on containers and microservices. Redesigned DataStage architecture no longer relies on XMETA, WAS for meta data or a windows client for developing jobs. Common Asset Management Infrastructure (CAMS) is the underlying repository for managing DataStage flows, jobs, parameter sets and it is queriable via APIs/CLI. Dataflow job development is done via new web canvas build using open-source project canvas. This is a run down on my exploration of this new and improved version of DataStage.

Sign up

To get started with IBM DataStage on Cloud Pak (aka IBM DataStage Next Gen), you would need an IBM account.

post thumb
  • Pick the region closest to you for hosting the services and data
post thumb
  • Create a new account if you don’t have an IBM account. Login into your existing account if you already have an IBM account
post thumb
  • Once you login into your IBM account, you will be prompted to select the services which you want in your IBM Cloud Pak for Data. Cloud Object Storage and DataStage services are selected by default, additionally Watson services can be selected if required. Since we are here for DataStage, we will not change the default selection, review the selection and click continue
post thumb
  • IBM Cloud Pak for Data will be provisioned in few seconds (approx 30 seconds). Click on “Go to IBM Cloud Pak for Data”
post thumb
  • This should take you to the welcome page for a guided tour. Take the tour to familiarize yourself around IBM Cloud Pak for Data UI.
post thumb

Create project

Once you are logged in into your IBM Cloud Pak for Data portal, the next step is to create a project to host the DataStage assets. Each project will have a related Cloud Object Storage area for storing the job logs, flat files and other artifacts which would require a file system.

  • Click on “Create a project” on the home page for IBM Cloud Pak for Data
post thumb
  • You will be presented with options to create a project. Click on “Create an empty project”
post thumb
  • Enter a preferred name for your project and click on “Create”
post thumb
  • Project will be created shortly and you will be directed to the project overview page. The summary page has a
    ✅ README section to add documentation for the project using Markdown
    ✅ Assets section to manage the connections and DataStage flows
    ✅ Jobs section to manage and review the execution of DataStage flows. Here you can also change the execution environment and the job schedule information
    ✅ Access section to manage the accesses for the project
post thumb

Create project assets

In this demo, we will source data from Snowflake, process in DataStage and create a flat file. In the next few steps we will create the Connection to Snowflake and DataStage flow for processing the data.

Everything demo’d here can be done in snowflake itself, but the idea of the demo is to show the usage of DataStage with an example.

Create Connection

  • On your project overview page, click on “Add to Project” and select “Connection”
post thumb
  • Select Snowflake as the connection type and fill in the details. Make sure to use the Snowflake account name with the region and leave the secure gateway unchecked for the demo.

    🔗 Snowflake - DataStage user setup

post thumb
  • Validate the connection details by clicking on “Test connection”. Click on “Create” to save the connection details after the test is successful.
post thumb
  • All connections in the project will be listed under “Assets” –> “Data assets” section of the project.
post thumb

Create DataStage flow

  • On your project overview page, click on “Add to Project” and select “DataStage flow”
post thumb
  • Give an appropriate name for the DataStage flow and click on “Create”
post thumb
  • From the palette, add Snowflake, Sequential file and Transformer stages. Connect the stages as shown below
post thumb post thumb
  • One of the good features with DataStage Next Gen is that you no longer have to drag and drop the column definitions to each stages. They are auto propgated to the next stages. We will add a simple constraint and derivation in the output of transfer stage as shown below
post thumb
  • In the output sequential file stage, update the target file name
post thumb

Execute DataStage flow

  • At this point we are ready to save, compile and execute the jobs. The buttons to save, complie and executing the jobs are located in the top left corner of your DataStage flow screen just like traditional DataStage
post thumb
  • You can see the execution status and job logs by clicking on “Logs” button in the top right corner of your Datastage flow screen
post thumb
  • The job logs and the outfiles are avaliable in IBM Cloud Object Storage. You can go the Object Storage for your project from the settings page of the Cloud Pak project
post thumb

We created a DataStage project and DataStage flow in few minutes. Prior to DataStage Next Gen, creating a DataStage environment would take hours worth of efforts and unless you are working in an organization that uses DataStage, getting hands on DataStage to try is not possible. With DataStage Next Gen and IBM DataStage Lite Plan (aka Free tier), you can have your DataStage environment up and running in few minutes and a place to try the new features that DataStage Next Gen has to offer. This is a good start and I am hoping IBM DataStage will continue to evolve at a good pace to add more stages and connectors to the echo system.

Hope this was helpful. Did I miss something ? Let me know in the forum section and I’ll add it in !

References

Exploring DataStage Next Generation



About The Authors
Siva Nadesan

Siva Nadesan is a Principal Data Engineer. His passion includes working on data engineering and writting technical blogs. He likes to learn new technologies and apply his knowledge to build solution for real world problems.

LinkedIn

Share: