Greenplum Interview Questions and Answers

Greenplum was a big data analytics company headquartered in san Mateo, California. Greenplum was nonheritable by EMC Corporation in July 2010. beginning in 2012 its management system software became referred to as the important Greenplum database sold-out through pivotal software and is presently actively developed by the Greenplum database open source community and pivotal.

In September 2017, Greenplum Database Version 5 was released. Version 5 includes the first iteration of the Greenplum project strategy of merging PostgreSQL later versions back into Greenplum and is based on PostgreSQL version 8.3 up from the previous version 8.2. Version 5 also introducing the General Availability of the GPORCA Optimizer for cost-based optimization of SQL designed for big data.Founded in June 2003; product GA in February 2006.

What is Greenplum(company)?

Greenplum database stores and processes massive amounts of information by distributing the load across many servers or hosts. A logical database in Greenplum is an array of individual PostgreSQL databases operating along to present one database image. The master is that the entry purpose to the Greenplum database system. it’s the database instance to that users connect and submit SQL statements.

[dt_sc_button type=”type1″ link=”http://www.interviewgig.com/discussion-room/post-a-question/” size=”large” bgcolor=”#7ed640″ textcolor=”#ffffff” target=”_blank” timeline_button=”no”]Post a Question[/dt_sc_button]

Explain how data is stored in Greenplum?

Data is hold on supported elect field (s) that are used for distribution. once you have a Distribution Key by Hash the values of the Distribution Key are run through a Hash Formula. Then, a map is used to distribute the row to the right section. The formula is designed to be consistent in order that all like values go to a similar section.

Data(A) -Hash Function(B) – Logical Segment list(C)- Physical Segment list(D) – Storage(E).

When data arrives at the Greenplum, it is hashed based on field(s) and a hash function (B) is used for this purpose.

For example, consider 4 node systems, logical segment list has 4 unique entries. If there are 10 hashed data items from (B), there are 10 entries in (C), then all having only 4 segment entries. For example (C) has values [4,1,2,3,4,3,1,4,3,2]. Then, a map is used to distribute the row to the correct segment. The formula is designed to be consistent so that all like values go to the same segment.

What are the environmental variables connect to Greenplum?

Can you explain Greenplum Database?

Greenplum Database is an advanced, open source, fully featured data platform. It provides powerful and fast analytics on petabyte scale data volumes. unambiguously meshed toward massive data analytics, Greenplum database is powered by the world’s most advanced cost-based question optimizer delivering high analytical query performance on massive data volumes. The Greenplum Database® project is released under the Apache a pair of license.

The Greenplum Database is a MPP database server with an architecture specially designed to manage large-scale analytic data warehouses and business intelligence workloads.

MPP (Massively Parallel Processing or other name shared nothing architecture) refers to systems with two or more processors that cooperate to carry out an operation, each processor with its own memory, operating system and disks. Greenplum uses this high-performance system architecture to distribute the load of multi-terabyte data warehouses and can use all of a system’s resources in parallel to process a query.

This Database is based on PostgreSQL open-source technology. It is essentially several PostgreSQL database instances acting together as one cohesive database management system (DBMS). It was originally based on PostgreSQL 8.2.15, and in many cases is very similar to PostgreSQL with regard to SQL support, features, configuration options, and end-user functionality. Database users interact with Greenplum Database as they would a regular PostgreSQL DBMS.

What’s new in Pivotal Greenplum 5.2.0?

Explain how do you connect to Greenplum system without password prompt?

Can you explain add new user to the database?

What is Greenplum DCA?

The Greenplum DCA is a self-contained data warehouse resolution that integrates all of the database software, servers and switches necessary to perform big data analytics. The DCA is a turn-key, simple to put in data warehouse solution that has extreme query and loading performance for analyzing massive data sets. The DCA integrates Greenplum database, data loading and Hadoop software with compute, storage and network components; delivered racked and prepared for immediate data loading and query execution.

What is EMC DCA?

The EMC Data Computing Appliance (DCA) is a unified data analytics appliance—a modular and flexible solution for analyzing data. The EMC Data Computing Appliance (DCA) is a platform purpose built to be both a modular and a fully integrated appliance for Pivotal Greenplum, the world’s first open source massively parallel data warehouse. It accelerates data analyses within a single appliance, delivering faster time to value and lower integration risks and total cost. With the EMC DCA, your organization can maximize security, availability and performance for Greenplum—without the complexity and constraints of proprietary hardware.

What are the Greenplum DCA Modules?

The Greenplum DCA is built from modules: GPDB, DIA and GP HD.

GPDB: These modules are a block of servers that host the Greenplum Database. GPDB is always the first module in a DCA. Two additional servers called Masters are included with the first GPDB module.

DIA: These modules are high capacity loading servers. The DIA servers are pre-configured with Greenplum’s gpfdist and gpload software for easy loading of data into GPDB modules

GP HD: These modules are configured with Greenplum’s Hadoop distribution and ready for high-performance unstructured data queries.

What kind of locks should we focus on MPP system when system is slow /hung?

Can you explain pg_dump and gp_dump?

Can you define gp_toolkit?

Can you define gpdetective and how do I run it in Greenplum?

Can you explain primary to mirror mapping?

Can you define segment in Greenplum?

Explain the advantage in Greenplum, compared to oracle and Greenplum?

Explain the vacuum and vacuum full?

Unless you need to return space to the OS so that other tables or other parts of the system can use that space, you should use VACUUM instead of VACUUM FULL.

VACUUM FULL is only needed when you have a table that is mostly dead rows, that is, the vast majority of its contents have been deleted. Even then, there is no point using VACUUM FULL unless you urgently need that disk space back for other things or you expect that the table will never again grow to its past size. Do not use it for table optimization or periodic maintenance as it is counterproductive.

Which one is better DELETE or TRUNCATE?

Can you explain difference between oracle and Greenplum?

What is Greenplum performance monitor and how to install?

Can you explain find errors / fatal from log files?

Can you define resource queues?

Can you explain the tools available in Greenplum to take backup and restores?

Can you explain the process of data migration from oracle to greenplum?

Which command would you use to back up a database?

Can you explain delete/drop an existing database in greenplum?

Can u explain run gpcheckcat?

The gpcheckcat tool is used to check catalog inconsistencies between master and segments. It can be found in the

$GPHOME/bin/lib directory:

Usage: gpcheckcat <option> [dbname]

B parallel: number of worker threads

g dir : generate SQL to rectify catalog corruption, put it in dir

h host : DB host name

p port : DB port number

P passwd : DB password

o : check OID consistency

U uname : DB User Name

v : verbose

Example: gpcheckcat gpdb >gpcheckcat_gpdb_logfile.log

Can you explain start/stop db in admin mode?

Can you see the value of GUC?

How do I clone my production databaes to PreProd or QA environment?

What is difference between pg_dump and gp_dump?

Why ETL is important for Greenplum?

As a data warehouse product of future, Greenplum is ready to method huge set of data that is sometimes in petabyte level, however Greenplum can’t generate such range of data by itself. data is usually generated by many users or embedded devices. Ideally, all data sources populate data to Greenplum directly however it’s not possible truly as a result of information is that the core quality of a company and Greenplum is simply one among several tools which will be used to produce value with data asset. One common answer is to use an intermediate system to store all the data.

When Greenplum is ready to load the data from intermediate system, the most important task became how to load the data effectively enough. In some cases, user would have to use the new generated data within limited delay. This is one of the most important task of ETL tools. In short, ETL tools help Greenplum to load data from external source reliably and effectively.

Can you explain Greenplum interconnect?

The interconnect is that the networking layer of Greenplum database. once a user connects to a database and problems a query, processes are created on every of the segments to handle the work of that query. The interconnect refers to the inter-process communication between the segments, in addition because the network infrastructure on that this communication depends. The interconnect uses a typical ten Gigabit ethernet change fabric.