What is DataStage?
Datastage is ETL (extract, transform and load) tool that is part of the IBM Infosphere suite. It is a tool that is used for working with large data warehouses and data marts for creating and maintaining such a data repository. A data stage is basically a tool that is used to design, develop and execute various applications to fill multiple tables in data warehouse or data marts. It is a program for Windows servers that extracts data from databases and change them into data warehouses. It has become an essential part of IBM WebSphere Data Integration suite.
What is Data partitioning?
Data partitioning is an approach to parallelism that involves breaking the records into partitions, or subsets of records. Data partitioning generally provides linear increases in application performance.When you design a job, you select the type of data partitioning algorithm that you want to use (hash, range, modulus, and so on). Then, at runtime, InfoSphere DataStage uses that selection for the number of degrees of parallelism that are specified dynamically at run time through the configuration file.
What are the important features of Datastage?
Datastage is used to perform the ETL operations (Extract, transform, load)
Datastage is the data integration component of IBM Infosphere information server.
Datastage is a GUI based tool.We just need to drag and drop the Datastage objects and we can convert it to Datastage code.
Datastage provides connectivity to multiple sources & multiple targets at the same time
Provides partitioning and parallel processing techniques which enable the Datastage jobs to process a huge volume of data quite faster.
It has enterprise-level connectivity.
How a source file is populated?
We can populate a source file in many ways such as by creating a SQL query in Oracle, or by using row generator extract tool etc.
What is usage analysis in DataStage?
If you want to whether a certain job is part of the sequence then you right click in the Manager on the job and then choose the Usage Analysis.
What is the command line function to import and export the DS jobs?
To import the DS jobs, dsimport.exe is used and to export the DS jobs, dsexport.exe is used.
What is the difference between Datastage 7.5 and 7.0?
In Datastage 7.5 many new stages are added for more robustness and smooth performance, such as Procedure Stage, Command Stage, and Generate Report etc.
What are the some differences between 7.x and 8.x version of DataStage?
Version of 7.X:
- It is platform dependent
- It has 2-tier architecture where datastage is built on top of UNIX server
- There is no concept of parameter set
- We had designer and manager as two separate clients
- We had to manually search for the jobs in this version
Version of 8.X:
- It is platform independent
- It has 3-tier architecture where we have UNIX server database at the bottom then XMETA database which acts as a repositorty and then we have datastage on top.
- We have parameter sets which can be used anywhere in the project.
- In this version, the manager client was merged into designer client
- We have quick find option in the repository where we can search easily for the jobs.
How you can fix the truncated data error in Datastage?
The truncated data error can be fixed by using ENVIRONMENT VARIABLE ‘ IMPORT_REJECT_STRING_FIELD_OVERRUN’.
Can you define Merge?
Merge means to join two or more tables. The two tables are joined on the basis of Primary key columns in both the tables.
What is the Differentiate between data file and descriptor file?
As the name implies, data files contains the data and the descriptor file contains the description/information about the data in the data files.
What is the difference between an Operational DataStage and a Data Warehouse?
An Operational DataStage can be considered as a staging area, for real-time analysis, for user processing. Thus it is a temporary repository. Whereas the data warehouse is used for long-term data storage needs and has the complete data of the entire business.
What is the difference between validated and Compiled in the Datastage?
In Datastage, validating a job means, executing a job. While validating, the Datastage engine verifies whether all the required properties are provided or not. In other case, while compiling a job, the Datastage engine verifies that whether all the given properties are valid or not.
How to manage date conversion in Datastage?
We can use date conversion function for this purpose i.e. Oconv (Iconv(Filedname,”Existing Date Format”),”Another Date Format”).
Why do we use exception activity in Datastage?
All the stages after the exception activity in Datastage are executed in case of any unknown error occurs while executing the job sequencer.
What is APT_CONFIG in Datastage?
It is the environment variable that is used to identify the *.apt file in Datastage. It is also used to store the node information, disk storage information and scratch information.
What are the different types of Lookups in Datastage?
There are two types of Lookups in Datastage i.e. Normal lkp and Sparse lkp. In Normal lkp, the data is saved in the memory first and then the lookup is performed. In Sparse lkp, the data is directly saved in the database. Therefore, the Sparse lkp is faster than the Normal lkp.
How a server job can be converted to a parallel job?
We can convert a server job in to a parallel job by using IPC stage and Link Collector.
Can you explain Repository tables in Datastage?
In Datastage, the Repository is another name for a data warehouse. It can be centralized as well as distributed.
What are Routines?
Routines are basically collection of functions that is defined by DS manager. It can be called via transformer stage. There are three types of routines such as, parallel routines, main frame routines and server routines.
How can you write parallel routines in datastage PX?
We can write parallel routines in C or C++ compiler. Such routines are also created in DS manager and can be called from transformer stage.
What is the method of removing duplicates, without the remove duplicate stage?
Duplicates can be removed by using Sort stage. We can use the option, as allow duplicate = false.
What are the different options associated with dsjob command?
- ex:$dsjob:run and also the options like
- stop :To stop the running job
- Lprojects: To list the projects
- ljobs:To list the jobs in project
- Lstages: To list the stages present in job.
- llinks :To list the links.
- projectinfo: returns the project information(hostname and project name)
- jobinfo: returns the job information(Job status,job runtime,endtime, etc.,)
- stageinfo :returns the stage name ,stage type,input rows etc.,)
- linkinfo: It returns the link information
- lparams:To list the parameters in a job
- paraminfo: returns the parameters info
- log: add a text message to log.
- logsum:To display the log
- logdetail:To display with details like event_id,time,messge
- lognewest:To display the newest log id.
- report: display a report contains Generated time, start time,elapsed time,status etc.,
- jobid: Job id information.
/dt_sc_toggle]
What steps should be taken to improve Datastage jobs?
In order to improve performance of Datastage jobs, we have to first establish the baselines. Secondly, we should not use only one flow for performance testing. Thirdly, we should work in increment. Then, we should evaluate data skews. Then we should isolate and solve the problems, one by one. After that, we should distribute the file systems to remove bottlenecks, if any. Also, we should not include RDBMS in start of testing phase. Last but not the least, we should understand and assess the available tuning knobs.
What is difference between Join, Merge and Lookup stage?
All the three concepts are different from each other in the way they use the memory storage, compare input requirements and how they treat various records. Join and Merge needs less memory as compared to the Lookup stage.
What is Quality stage?
Quality stage is also known as Integrity stage. It assists in integrating different types of data from various sources.
What is the sortmerge collector?
The sortmerge collector reads records in an order based on one or more fields of the record. The fields used to define record order are called collecting keys.
What is aggtorec restructure operator?
aggtorec restructure operator groups records that have the same key-field values into an output record
what is Job control?
Job control can be best performed by using Job Control Language (JCL). This tool is used to execute multiple jobs simultaneously, without using any kind of loop.
What is difference between Symmetric Multiprocessing and Massive Parallel Processing?
In Symmetric Multiprocessing, the hardware resources are shared by processor. The processor has one operating system and it communicates through shared memory. While in Massive Parallel processing, the processor access the hardware resources exclusively. This type of processing is also known as Shared Nothing, since nothing is shared in this. It is faster than the Symmetric Multiprocessing.
What are the steps required to kill the job in Datastage?
To kill the job in Datasatge, we have to kill the respective processing ID.
Can you explain Kafka connector?
Kafka connector has been enhanced with the following new capabilities:
Continuous mode, where incoming topic messages are consumed without stopping the connector.
Transactions, where a number of Kafka messages is fetched within a single transaction. After record count is reached, an end of wave marker is sent to the output link.
TLS connection to Kafka.
Kerberos keytab locality is supported.
What is the Project in Datastage?
Whenever we launch the Datastage client, we are asked to connect to a Datastage project. A Datastage project contains Datastage jobs, built-in components and Datastage Designer or User-Defined components.
What are the features of DataStage Flow Designer?
Flow Designer Features
IBM DataStage Flow Designer has many features to enhance your job building experience.
We can use the palette to drag and drop connectors and operators on to the designer canvas.
We can link nodes by selecting the previous node and dropping the next node, or drawing the link between the two nodes.
We can edit stage properties on the side-bar, and make changes to your schema in Column Properties tab.
We can zoom in and zoom out using your mouse, and leverage the mini-map on the lower-right of the window to focus on a particular part of the DataStage job.
This is very useful when you have a very large job with tens or hundreds of stages.
How many types of hash files are there?
There are two types of hash files in DataStage i.e. Static Hash File and Dynamic Hash File. The static hash file is used when limited amount of data is to be loaded in the target database. The dynamic hash file is used when we don’t know the amount of data from the source file.
How do you import and export data into Datastage?
Here are the points how to import and export data into datastage
The import/export utility consists of these operators: The import operator: imports one or more data files into a single data set. The export operator: exports a data set to one or more data files.
Can you explain tagbatch restructure operator?
tagbatch restructure operator converts tagged fields into output records whose schema supports all the possible fields of the tag cases.
Can you explain Engine tier in Information server?
The engine tier includes the logical group of components (the InfoSphere Information Server engine components, service agents, and so on) and the computer where those components are installed. The engine runs jobs and other tasks for product modules.
What is Meta Stage?
In Datastage, MetaStage is used to save metadata that is helpful for data lineage and data analysis.
Have you have ever worked in UNIX environment and why it is useful in Datastage?
Yes, I have worked in UNIX environment. This knowledge is useful in Datastage because sometimes one has to write UNIX programs such as batch programs to invoke batch processing etc.
What is the difference between Datastage and Datastage TX?
Datastage is a tool from ETL (Extract, Transform and Load) and Datastage TX is a tool from EAI (Enterprise Application Integration).
What is size of a transaction and an array means in a Datastage?
Transaction size means the number of row written before committing the records in a table. An array size means the number of rows written/read to or from the table respectively.
How many types of views are there in a Datastage Director?
There are three types of views in a Datastage Director i.e. Job View, Log View and Status View.
Can you explain Link buffering?
InfoSphere DataStage automatically performs buffering on the links of certain stages. This is primarily intended to prevent deadlock situations arising (where one stage is unable to read its input because a previous stage in the job is blocked from writing to its output).
Why we use surrogate key?
In Datastage, we use Surrogate Key instead of unique key. Surrogate key is mostly used for retrieving data faster. It uses Index to perform the retrieval operation.
How rejected rows are managed in Datastage?
In the Datastage, the rejected rows are managed through constraints in transformer. We can either place the rejected rows in the properties of a transformer or we can create a temporary storage for rejected rows with the help of REJECTED command.
Can you explain Players in Datastage?
Players are the workhorse processes in a parallel job. There is generally a player for each operator on each node. Players are the children of section leaders; there is one section leader per processing node. Section leaders are started by the conductor process running on the conductor node (the conductor node is defined in the configuration file).
What is the difference between ODBC and DRS stage?
DRS stage is faster than the ODBC stage because it uses native databases for connectivity.
What is a DataStage job?
The Datastage job is simply a DataStage code that we create as a developer. It contains different stages linked together to define data and process flow.Stages are nothing but the functionalities that get implemented.
Why do we use Link Partitioner and Link Collector in Datastage?
In Datastage, Link Partitioner is used to divide data into different parts through certain partitioning methods. Link Collector is used to gather data from various partitions/segments to a single data and save it in the target table.
What is the difference between Orabulk and BCP stages?
Orabulk stage is used to load large amount of data in one target table of Oracle database. The BCP stage is used to load large amount of data in one target table of Microsoft SQL Server.
What is DS Designer?
The DS Designer is used to design work area and add various links to it.
What is the Roundrobin collector?
The roundrobin collector reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the collector starts over. After reaching the final record in any partition, the collector skips that partition.