Datastage Interview Questions and Answers

IBM InfoSphere DataStage is a leading ETL platform that integrates data across multiple enterprise systems. It leverages a high performance parallel framework, available on-premises or in the cloud. The scalable platform provides extended metadata management and enterprise connectivity. It integrates heterogeneous data, including big data at rest (Hadoop-based) or big data in motion (stream-based), on both distributed and mainframe platforms. It supports IBM Db2 Z and Db2 for z/OS, applies workload and business rules, and integrates real-time data in an easy to deploy platform.(source :IBM official doc )

What is DataStage?

Datastage is ETL (extract, transform and load) tool that is part of the IBM Infosphere suite. It is a tool that is used for working with large data warehouses and data marts for creating and maintaining such a data repository. A data stage is basically a tool that is used to design, develop and execute various applications to fill multiple tables in data warehouse or data marts. It is a program for Windows servers that extracts data from databases and change them into data warehouses. It has become an essential part of IBM WebSphere Data Integration suite.

What is Data partitioning?

Data partitioning is an approach to parallelism that involves breaking the records into partitions, or subsets of records. Data partitioning generally provides linear increases in application performance.When you design a job, you select the type of data partitioning algorithm that you want to use (hash, range, modulus, and so on). Then, at runtime, InfoSphere DataStage uses that selection for the number of degrees of parallelism that are specified dynamically at run time through the configuration file.

What are the important features of Datastage?

Datastage is used to perform the ETL operations (Extract, transform, load)

Datastage is the data integration component of IBM Infosphere information server.

Datastage is a GUI based tool.We just need to drag and drop the Datastage objects and we can convert it to Datastage code.

Datastage provides connectivity to multiple sources & multiple targets at the same time

Provides partitioning and parallel processing techniques which enable the Datastage jobs to process a huge volume of data quite faster.

It has enterprise-level connectivity.

How a source file is populated?

What is usage analysis in DataStage?

What is the command line function to import and export the DS jobs?

What is the difference between Datastage 7.5 and 7.0?

What are the some differences between 7.x and 8.x version of DataStage?

Version of 7.X:

It is platform dependent
It has 2-tier architecture where datastage is built on top of UNIX server
There is no concept of parameter set

We had designer and manager as two separate clients
We had to manually search for the jobs in this version

Version of 8.X:

It is platform independent
It has 3-tier architecture where we have UNIX server database at the bottom then XMETA database which acts as a repositorty and then we have datastage on top.
We have parameter sets which can be used anywhere in the project.

In this version, the manager client was merged into designer client
We have quick find option in the repository where we can search easily for the jobs.

How you can fix the truncated data error in Datastage?

Can you define Merge?

What is the Differentiate between data file and descriptor file?

What is the difference between an Operational DataStage and a Data Warehouse?

What is the difference between validated and Compiled in the Datastage?

How to manage date conversion in Datastage?

Why do we use exception activity in Datastage?

What is APT_CONFIG in Datastage?

What are the different types of Lookups in Datastage?

How a server job can be converted to a parallel job?

Can you explain Repository tables in Datastage?

What are Routines?

How can you write parallel routines in datastage PX?

What is the method of removing duplicates, without the remove duplicate stage?

What are the different options associated with dsjob command?

ex:$dsjob:run and also the options like
stop :To stop the running job

Lprojects: To list the projects
ljobs:To list the jobs in project
Lstages: To list the stages present in job.

llinks :To list the links.
projectinfo: returns the project information(hostname and project name)
jobinfo: returns the job information(Job status,job runtime,endtime, etc.,)

stageinfo :returns the stage name ,stage type,input rows etc.,)
linkinfo: It returns the link information
lparams:To list the parameters in a job

paraminfo: returns the parameters info
log: add a text message to log.
logsum:To display the log

logdetail:To display with details like event_id,time,messge
lognewest:To display the newest log id.
report: display a report contains Generated time, start time,elapsed time,status etc.,

jobid: Job id information.

/dt_sc_toggle]

What steps should be taken to improve Datastage jobs?

In order to improve performance of Datastage jobs, we have to first establish the baselines. Secondly, we should not use only one flow for performance testing. Thirdly, we should work in increment. Then, we should evaluate data skews. Then we should isolate and solve the problems, one by one. After that, we should distribute the file systems to remove bottlenecks, if any. Also, we should not include RDBMS in start of testing phase. Last but not the least, we should understand and assess the available tuning knobs.

What is difference between Join, Merge and Lookup stage?

What is Quality stage?

What is the sortmerge collector?

What is aggtorec restructure operator?

what is Job control?

What is difference between Symmetric Multiprocessing and Massive Parallel Processing?

In Symmetric Multiprocessing, the hardware resources are shared by processor. The processor has one operating system and it communicates through shared memory. While in Massive Parallel processing, the processor access the hardware resources exclusively. This type of processing is also known as Shared Nothing, since nothing is shared in this. It is faster than the Symmetric Multiprocessing.

What are the steps required to kill the job in Datastage?

Can you explain Kafka connector?

Kafka connector has been enhanced with the following new capabilities:

Continuous mode, where incoming topic messages are consumed without stopping the connector.

Transactions, where a number of Kafka messages is fetched within a single transaction. After record count is reached, an end of wave marker is sent to the output link.

TLS connection to Kafka.

Kerberos keytab locality is supported.

What is the Project in Datastage?

What are the features of DataStage Flow Designer?

Flow Designer Features

IBM DataStage Flow Designer has many features to enhance your job building experience.

We can use the palette to drag and drop connectors and operators on to the designer canvas.

We can link nodes by selecting the previous node and dropping the next node, or drawing the link between the two nodes.

We can edit stage properties on the side-bar, and make changes to your schema in Column Properties tab.

We can zoom in and zoom out using your mouse, and leverage the mini-map on the lower-right of the window to focus on a particular part of the DataStage job.

This is very useful when you have a very large job with tens or hundreds of stages.

What is DataStage?

What is Data partitioning?

What are the important features of Datastage?

How a source file is populated?

What is usage analysis in DataStage?

What is the command line function to import and export the DS jobs?

What is the difference between Datastage 7.5 and 7.0?

What are the some differences between 7.x and 8.x version of DataStage?

How you can fix the truncated data error in Datastage?

Can you define Merge?

What is the Differentiate between data file and descriptor file?

What is the difference between an Operational DataStage and a Data Warehouse?

What is the difference between validated and Compiled in the Datastage?

How to manage date conversion in Datastage?

Why do we use exception activity in Datastage?

What is APT_CONFIG in Datastage?

What are the different types of Lookups in Datastage?

How a server job can be converted to a parallel job?

Can you explain Repository tables in Datastage?

What are Routines?

How can you write parallel routines in datastage PX?

What is the method of removing duplicates, without the remove duplicate stage?

What are the different options associated with dsjob command?

What is difference between Join, Merge and Lookup stage?

What is Quality stage?

What is the sortmerge collector?

What is aggtorec restructure operator?

what is Job control?

What is difference between Symmetric Multiprocessing and Massive Parallel Processing?

What are the steps required to kill the job in Datastage?

Can you explain Kafka connector?

What is the Project in Datastage?

What are the features of DataStage Flow Designer?

How many types of hash files are there?

How do you import and export data into Datastage?

Can you explain tagbatch restructure operator?

Can you explain Engine tier in Information server?

What is Meta Stage?

Have you have ever worked in UNIX environment and why it is useful in Datastage?

What is the difference between Datastage and Datastage TX?

What is size of a transaction and an array means in a Datastage?

How many types of views are there in a Datastage Director?

Can you explain Link buffering?

Why we use surrogate key?

How rejected rows are managed in Datastage?

Can you explain Players in Datastage?

What is the difference between ODBC and DRS stage?

What is a DataStage job?

Why do we use Link Partitioner and Link Collector in Datastage?

What is the difference between Orabulk and BCP stages?

What is DS Designer?

What is the Roundrobin collector?

Related Posts