What is Greenplum(company)?
Greenplum database stores and processes massive amounts of information by distributing the load across many servers or hosts. A logical database in Greenplum is an array of individual PostgreSQL databases operating along to present one database image. The master is that the entry purpose to the Greenplum database system. it’s the database instance to that users connect and submit SQL statements.
[dt_sc_button type=”type1″ link=”http://www.interviewgig.com/discussion-room/post-a-question/” size=”large” bgcolor=”#7ed640″ textcolor=”#ffffff” target=”_blank” timeline_button=”no”]Post a Question[/dt_sc_button]
Explain how data is stored in Greenplum?
Data is hold on supported elect field (s) that are used for distribution. once you have a Distribution Key by Hash the values of the Distribution Key are run through a Hash Formula. Then, a map is used to distribute the row to the right section. The formula is designed to be consistent in order that all like values go to a similar section.
Data(A) -Hash Function(B) – Logical Segment list(C)- Physical Segment list(D) – Storage(E).
When data arrives at the Greenplum, it is hashed based on field(s) and a hash function (B) is used for this purpose.
For example, consider 4 node systems, logical segment list has 4 unique entries. If there are 10 hashed data items from (B), there are 10 entries in (C), then all having only 4 segment entries. For example (C) has values [4,1,2,3,4,3,1,4,3,2]. Then, a map is used to distribute the row to the correct segment. The formula is designed to be consistent so that all like values go to the same segment.
What are the environmental variables connect to Greenplum?
Required environmental variables: LD_LIBRARY_PATH, MASTER_DATA_DIRECTORY, PATH,
GPHOME.
Optional environmental variables: PGAPPNAME, PGHOST, PGUSER, PGPASSWORD, PGDATABASE, PGHOSTADDR, PGPASSFILE, PGDATESTYLE, PGTZ, PGCLIENTENCODING, PGPORT.
Â
Can you explain Greenplum Database?
Greenplum Database is an advanced, open source, fully featured data platform. It provides powerful and fast analytics on petabyte scale data volumes. unambiguously meshed toward massive data analytics, Greenplum database is powered by the world’s most advanced cost-based question optimizer delivering high analytical query performance on massive data volumes. The Greenplum Database® project is released under the Apache a pair of license.
The Greenplum Database is a MPP database server with an architecture specially designed to manage large-scale analytic data warehouses and business intelligence workloads.
MPP (Massively Parallel Processing or other name shared nothing architecture) refers to systems with two or more processors that cooperate to carry out an operation, each processor with its own memory, operating system and disks. Greenplum uses this high-performance system architecture to distribute the load of multi-terabyte data warehouses and can use all of a system’s resources in parallel to process a query.
This Database is based on PostgreSQL open-source technology. It is essentially several PostgreSQL database instances acting together as one cohesive database management system (DBMS). It was originally based on PostgreSQL 8.2.15, and in many cases is very similar to PostgreSQL with regard to SQL support, features, configuration options, and end-user functionality. Database users interact with Greenplum Database as they would a regular PostgreSQL DBMS.
Â
What’s new in Pivotal Greenplum 5.2.0?
Pivotal Greenplum 5.2.0 includes these new features.
Partitioned Table Enhancement for External Tables
- GPORCA Partition Elimination Enhancement
- analyzedb Utility Enhancement
- Resource Groups
- Greenplum Platform Extension Framework (PXF)
- Dell EMC DCA Support
- password check Module
Explain how do you connect to Greenplum system without password prompt?
Assign the values to Greenplum environmental variables and export value in bash shell. You can provide the below statement in. bashrc file which is present in user $HOME directory. Values will be set every time your login to system.
- export PGHOST=master_host_name/IP address
- export PGPORT=port.
- export PGUSER=username
- export PGPASSWORD=password
- export PGDATABASE=database
Can you explain add new user to the database?
Use create user utility to create users.
You can also use SQL commands in psql prompt to create users.
For example: CREATE USER or ROLE ….
What is Greenplum DCA?
The Greenplum DCA is a self-contained data warehouse resolution that integrates all of the database software, servers and switches necessary to perform big data analytics. The DCA is a turn-key, simple to put in data warehouse solution that has extreme query and loading performance for analyzing massive data sets. The DCA integrates Greenplum database, data loading and Hadoop software with compute, storage and network components; delivered racked and prepared for immediate data loading and query execution.
What is EMC DCA?
The EMC Data Computing Appliance (DCA) is a unified data analytics appliance—a modular and flexible solution for analyzing data. The EMC Data Computing Appliance (DCA) is a platform purpose built to be both a modular and a fully integrated appliance for Pivotal Greenplum, the world’s first open source massively parallel data warehouse. It accelerates data analyses within a single appliance, delivering faster time to value and lower integration risks and total cost. With the EMC DCA, your organization can maximize security, availability and performance for Greenplum—without the complexity and constraints of proprietary hardware.
What are the Greenplum DCA Modules?
The Greenplum DCA is built from modules: GPDB, DIA and GP HD.
GPDB: These modules are a block of servers that host the Greenplum Database. GPDB is always the first module in a DCA. Two additional servers called Masters are included with the first GPDB module.
DIA: These modules are high capacity loading servers. The DIA servers are pre-configured with Greenplum’s gpfdist and gpload software for easy loading of data into GPDB modules
GP HD: These modules are configured with Greenplum’s Hadoop distribution and ready for high-performance unstructured data queries.
What kind of locks should we focus on MPP system when system is slow /hung?
Locks that are held for a very long time and multiple other queries are waiting for that lock also.
Can you explain pg_dump and gp_dump?
pg_dump : Non-parallel backup utility, you need big file system where backup will be created in the master node only.
gp_dump : Parallel backup utility. Backup will be created in master and segments file system.
Can you define gp_toolkit?
It is a database schema, which has many tables, views and functions to better manage Greenplum Database when DB is up. In 3.X earlier version it was referred to as gp_jetpack.
Â
Can you define gpdetective and how do I run it in Greenplum?
The gpdetective utility collects information from a running Greenplum Database system and creates a bzip2-compressed tar output file. This output file helps with the diagnosis of Greenplum Database errors or system failures. for more details check help.
gpdetective ;help
Can you explain primary to mirror mapping?
From database catalog following query list configuration on content ID, you can figure out primary and mirror for each content.
gpdb=# select * from gp_configuration order by content.
Note: starting from GPDB 4.x, gp_segment_configuration table is used instead.
gpdb=# select * from gp_segment_configuration order by dbid;
Can you define segment in Greenplum?
Database instances in the Greenplum system are called segments. Segment stores the data and carry out the query processing. In Greenplum distributed systems, each segment contains a distinct portion of the data.
Explain the advantage in Greenplum, compared to oracle and Greenplum?
It is shared nothing, MPP architecture best for data warehousing env. Greenplum is built on top of Postgresql. Good for big data analytics purpose. Oracle is all purpose Database.
Explain the vacuum and vacuum full?
Unless you need to return space to the OS so that other tables or other parts of the system can use that space, you should use VACUUM instead of VACUUM FULL.
VACUUM FULL is only needed when you have a table that is mostly dead rows, that is, the vast majority of its contents have been deleted. Even then, there is no point using VACUUM FULL unless you urgently need that disk space back for other things or you expect that the table will never again grow to its past size. Do not use it for table optimization or periodic maintenance as it is counterproductive.
Which one is better DELETE or TRUNCATE?
TRUNCATE is better. If you use DELETE, Greenplum will not remove the data; instead it’ll logically remove by flagging xmax with XID.
Can you explain difference between oracle and Greenplum?
Oracle is relational database. Greenplum is MPP nature. Greenplum is shared nothing architecture. There are many other differences in terms of functionality and behaviour.
What is Greenplum performance monitor and how to install?
It’s a monitoring tool that collects statistics on system and query performance and builds historical data
Can you explain find errors / fatal from log files?
grep for ERRORS, FATAL, SIGSEGV in pg_logs directory
Can you define resource queues?
Resource queues are used to manager Greenplum database workload management. All user / queries can be prioritized using Resource queues.
Can you explain the tools available in Greenplum to take backup and restores?
non-parallel backups:
Use postgres utililities (pg_dump, pg_dumpall for backup, and pg_restore for restore).
Another useful command for getting data out of database is the COPY <TABLE> to <File>.
parallel backups:
gp_dump and gpcrondump for backups and gp_restore for restore process.
Can you explain the process of data migration from oracle to greenplum?
There are many ways. Simplest steps are Unload data into csv files, create tables in greenplum database corresponding to Oracle, create external table, start gpfdist pointing to external table location, Load data into greenplum. You can also use gpload utility. Gpload creates external table at runtime.
Which command would you use to back up a database?
gp_dump, gpcrondump, pg_dump, pg_dumpall, copy
Can you explain delete/drop an existing database in greenplum?
gpdb=# h DROP Database
Command: DROP DATABASE
Description: remove a database
Syntax:DROP DATABASE [ IF EXISTS ] nameAlso check dropdb utility:$GPHOME/bin/dropdb –help dropdb removes a PostgreSQL database.
Usage:
dropdb [OPTION]… DBNAME.
Can u explain run gpcheckcat?
The gpcheckcat tool is used to check catalog inconsistencies between master and segments. It can be found in the
$GPHOME/bin/lib directory:
Usage: gpcheckcat <option> [dbname]
B parallel: number of worker threads
g dir : generate SQL to rectify catalog corruption, put it in dir
h host : DB host name
p port : DB port number
P passwd : DB password
o : check OID consistency
U uname : DB User Name
v : verbose
Example: gpcheckcat gpdb >gpcheckcat_gpdb_logfile.log
Can you explain start/stop db in admin mode?
Admin mode: The gpstart with option (-R) is stands for Admin mode or restricted mode where only super users can connect to database when database opened using this option.
utility mode: Utility mode allows you to connect to only individual segments when started using gpstart -m, for example< to connect to only master instance only:
PGOPTIONS=’-c gp_session_role=utility’ psql
Can you see the value of GUC?
By connecting GPDB database using psql query catalog or do show parameter.
Example: gpdb# select name,setting from pg_settings where name=’GUC’;
Or gpdb# show <GUC_NAME>;
How do I clone my production databaes to PreProd or QA environment?
If Prod and QA on same GPDB cluster, use CREATE database <Clone_DBname> template <Source_DB>.
If Prod and QA are on different clusters, use backup and restore utilities.
What is difference between pg_dump and gp_dump?
pg_dump – Non-parallel backup utility, you need big file system where backup will be created in the master node only.
gp_dump – Parallel backup utility. Backup will be created in master and segments file system.
Why ETL is important for Greenplum?
As a data warehouse product of future, Greenplum is ready to method huge set of data that is sometimes in petabyte level, however Greenplum can’t generate such range of data by itself. data is usually generated by many users or embedded devices. Ideally, all data sources populate data to Greenplum directly however it’s not possible truly as a result of information is that the core quality of a company and Greenplum is simply one among several tools which will be used to produce value with data asset. One common answer is to use an intermediate system to store all the data.
When Greenplum is ready to load the data from intermediate system, the most important task became how to load the data effectively enough. In some cases, user would have to use the new generated data within limited delay. This is one of the most important task of ETL tools. In short, ETL tools help Greenplum to load data from external source reliably and effectively.
Can you explain Greenplum interconnect?
The interconnect is that the networking layer of Greenplum database. once a user connects to a database and problems a query, processes are created on every of the segments to handle the work of that query. The interconnect refers to the inter-process communication between the segments, in addition because the network infrastructure on that this communication depends. The interconnect uses a typical ten Gigabit ethernet change fabric.