What is Data Analyst?
A data analyst is someone who collects, processes and performs statistical analyses of data. He or she can translate numbers and data into plain English in order to help organizations and companies understand how to make better business decisions. Whether it be market research, sales figures, logistics, or transportation costs, every business collects data. A data analyst will take that data and figure out a variety of things, such as how to price new materials, how to reduce transportation costs, or how to deal with issues that cost the company money.
What does a Data Analyst do?
Data Analysts look for patterns and clues in raw data and translate those numbers into something understandable to improve how a business or project is run. Suppose, Data comes from a hundred different sources: it can be in raw form on a computer database, or you may take surveys from customers, or use data for comparison from other large companies. If you’re preparing a report, you will need to collect all your data, and make it meaningful and understandable to those who aren’t necessarily logical or mathematical, so as you collect data, you will need to know where it is going to fit in.
As a data analyst, someone typically handles data coming from or going into a data warehouse or business intelligence system. They compile the reports, verify the quality (integrity), and use the data to assist executive- and senior-level staff to make informed company decisions. The work can also include information visualization, statistics, and/or database application design, depending on the needs of the organization.
Data analysts typically use computer systems and calculation applications to figure out their numbers. Data must be regulated, normalized, and calibrated so that it can be extracted, used alone, or put in with other numbers and still keep its integrity. Facts and numbers are the starting point, but what is most important is understanding what they mean and presenting the findings in an interesting way, using graphs, charts, tables, and graphics.
Data analysts need to have the ability to not only decipher data, but to report and explain what differences in numbers mean when looked at from year to year or across various departments. Because data analysts are often the ones with the best sense of why the numbers are the way they are, they are often asked to advise project managers and department heads concerning certain data points and how they can be changed or improved over a period of time.
What are responsibilities of Data Analyst?
- First, identify areas where your company needs to improve in efficiency and process automation
- Develop and supporting the data analysts if in problem. Work as a team.
- Proper coordination with the customers and the working staff1.
- Resolve the issues related to audit on data analysis.
- Design, conduct and analyze survey data.
- Interoperating the data and analyzing the results by using skilled techniques.
- Provide ongoing daily reports. From time to time, you would prepare reports for both internal and external audiences using various business analytics reporting tools.
- Work with both internal and external clients to adequately understand data content.
- Interpret, analyze and identify different patterns in the complex data sheets.
- Maintain the data base on the data system.
- Acquire data from primary or secondary data sources.
- Filter and clean the data for balanced computer reports.
- Resolve code related problems by keeping a regular check on the data reports.
- Securing the data base and maintaining a user access data base for high security purpose.
What are the skills needed to become a data analyst?
Skills Required by a Data Analyst:
Analytical Skills: Analytical skills are of huge importance in data analysis. These skills refer to the ability to gather, view and analyze all forms of information in details. They also mean the ability to view a challenge or situation from different perspectives.
Programming skills: Knowing programming languages are R and Python are extremely important for any data analyst. Some computer software and tools including; scripting language (MATLAB, Python), Querying Language (SQL, Hive, Pig), Spreadsheet (Excel) and Statistical Language (SAS, R, SPSS). Other computer skills include; programming (JavaScript, XML), big data tools (Spark, Hive HQL) and so on.
Statistical skills and Mathematics skills: Descriptive and inferential statistics and experimental designs are also a must for data analysts.
Business Skills: You also need to possess certain business skills to function well as a data analyst
Data Munging or Data wrangling skills: The ability to map raw data and convert it into another format that allows for a more convenient consumption of the data.
Understanding Databases: Essentially used to better understand the customer, database analysis extends from basic analysis to complex data mining through various tools – Geographic Information System (GIS) or text analysis. The basic steps for analyzing databases are to extract, clean, merge, analyses and implement.
Machine learning skills, Communication and Data Visualization skills.
Some of the areas where you can work as a data analyst include:
- Data Assurance
- Finance
- Higher Education
- Sales
- Marketing
- Business Intelligence
- Data Quality
What are the various steps in an analytics project?
Various steps in an analytics project include
- Problem definition
- Data exploration
- Data preparation
- Data Modelling
- Validation of data
- Implementation and tracking
What is a data engineer?
Data engineers build and optimize the systems that allow data scientists and analysts to perform their work. Every company depends on its data to be accurate and accessible to individuals who need to work with it. The data engineer ensures that any data is properly received, transformed, stored, and made accessible to other users.
What is a data scientist?
A data scientist is a specialist that applies their expertise in statistics and building machine learning models to make predictions and answer key business questions. A data scientist still needs to be able to clean, analyze, and visualize data, just like a data analyst. However, a data scientist will have more depth and expertise in these skills and will also be able to train and optimize machine learning models.
What is data cleansing?
Data cleaning also referred as data cleansing, deals with identifying and removing errors and inconsistencies from data in order to enhance the quality of data.
Read :Datascience Interview Questions and Answers
In how many ways can we perform Data Cleansing?
Data cleansing process can be done in the following ways:
- Shorting the data by various attributes.
- In large and big data sheets, the cleaning should be done step wise in order to achieve result for the given data.
- For big projects, break down the data sheets into parts and work on it in a sequence manner which will help you to come with the perfect data faster as compared to working on the whole lot at once.
- For the cleansing process make a set of utility tools which will help you to maximize the speed of the process and reduce the duration for completion of the process.
- Arrange the data by estimated frequency and start by clearing the most common problems first.
- For faster cleaning, analyze the summary of the data.
- By keeping a check over daily data cleansing, you can improvise the set of utility tools as per requirements.
Can you define Data Profiling?
Data Profiling, also referred to as Data Archeology is the process of assessing the data values in a given dataset for uniqueness, consistency and logic. Data profiling cannot identify any incorrect or inaccurate data but can detect only business rules violations or anomalies. The main purpose of data profiling is to find out if the existing data can be used for various other purposes.
Can you define Data Mining?
Data Mining refers to the analysis of datasets to find relationships that have not been discovered earlier. It focusses on sequenced discoveries or identifying dependencies, bulk analysis, finding various types of attributes, etc.
Can you define logistic regression?
The logistic regression works on a statistical manner for examining a data sheet having one or more than one variable in it, for defining an outcome
How will you handle the QA process when developing a predictive model to forecast customer churn?
Data analysts require inputs from the business owners and a collaborative environment to operationalize analytics. To create and deploy predictive models in production there should be an effective, efficient and repeatable process. Without taking feedback from the business owner, the model will just be a one-and-done model.
Can you define K-mean Algorithm?
K mean is a famous partitioning method. Objects are classified as belonging to one of K groups, k chosen a priori.
In K-mean algorithm,
- The clusters are spherical: the data points in a cluster are centered on that cluster.
- The variance/spread of the clusters is similar: Each data point belongs to the closest cluster.
What are the important steps in data validation process?
Data Validation is performed in 2 different steps-
Data Screening: In this step various algorithms are used to screen the entire data to find any erroneous or questionable values. Such values need to be examined and should be handled.
Data Verification: In this step each suspect value is evaluated on case by case basis and a decision is to be made if the values have to be accepted as valid or if the values have to be rejected as invalid or if they have to be replaced with some redundant values.
What are the general problems in the work of Data Analyst?
The common problems which occur in the work of a Data Analyst are as follows:
- Rectifying the overlapped data
- Improper and illegal values
- Different value representation
- Finding missing values
- Over copied entries
- General misspelling
What are the missing patterns that are generally observed while working on a data sheet?
The missing patterns that are generally observed are as follows:
- Unobserved input variable
- Missing value itself
- Random missing
- Random missing completely
You are assigned a new data analytics project. How will you begin with and what are the steps you will follow?
How to answering this question? The purpose of asking this question is that the interviewer wants to understand how you approach a given data problem and what is the though process you follow to ensure that you are organized. You can start answering this question by saying that you will start with finding the objective of the given problem and defining it so that there is solid direction on what need to be done.
The next step would be to do data exploration and familiarize myself with the entire dataset which is very important when working with a new dataset. The next step would be to prepare the data for modelling which would including finding outliers, handling missing values and validating the data. Having validated the data, I will start data modelling until I discover any meaningful insights. After this the final step would be to implement the model and track the output results.
What is the criteria for a good data model?
- Criteria for a good data model includes
- It can be easily consumed
- Large data changes in a good model should be scalable
- It should provide predictable performance
- A good model can adapt to changes in requirements
Can you explain hash table?
A hash table is a data construct that stores a set of items. Each item has a key that identifies the item. Items are found, added, and removed from the hash table by using the key. Hash tables may seem like arrays, but there are important differences:
Hashing is implemented in two steps:
- An element is converted into an integer by using a hash function. This element can be used as an index to store the original element, which falls into the hash table.
- The element is stored in the hash table where it can be quickly retrieved using hashed key.
Can you define Re-hashing?
Re-hashing schemes use a second hashing operation when there is a collision. If there is a further collision, we re-hash until an empty “slot” in the table is found.The re-hashing function can either be a new function or a re-application of the original one. As long as the functions are applied to a key in the same order, then a sought key can always be located.
What are hash table collisions? How is it avoided?
A hash table collision happens when two different keys hash to the same value. Two data cannot be stored in the same slot in array.
To avoid hash table collision there are many techniques, here we list out two
Separate Chaining: It uses the data structure to store multiple items that hash to the same slot.
Open addressing: It searches for other slots using a second function and store item in first empty slot that is found
Can you define KPI?
A Key Performance Indicator is a measurable value that demonstrates how effectively a company is achieving key business objectives. Organizations use KPIs at multiple levels to evaluate their success at reaching targets. High-level KPIs may focus on the overall performance of the enterprise, while low-level KPIs may focus on processes in departments such as sales, marketing or a call center.
Can you define time series analysis?
Time series analysis can be done in two domains, frequency domain and the time domain. In Time series analysis the output of a particular process can be forecast by analyzing the previous data by the help of various methods like exponential smoothening, log-linear regression method, etc.
Can you define correlogram analysis?
A correlogram analysis is the common form of spatial analysis in geography. It consists of a series of estimated autocorrelation coefficients calculated for a different spatial relationship. It can be used to construct a correlogram for distance-based data, when the raw data is expressed as distance rather than values at individual points.
Read :Machine Learning Interview Questions and Answers
Can you define clustering?
In a computer system, a cluster is a group of servers and other resources that act like a single system and enable high availability and, in some cases, load balancing and parallel processing. (or)
Clustering is the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities. Regarding to data mining, this methodology partitions the data implementing a specific join algorithm, most suitable for the desired information analysis.
Clustering is a classification method that is applied to data. Clustering algorithm divides a data set into natural groups or clusters.
- Properties for clustering algorithm are
- Hierarchical or flat
- Iterative
- Hard and soft
- Disjunctive
Can you define n-gram?
N-gram: An n-gram is a contiguous sequence of n items from a given sequence of text or speech. It is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n-1).
Can you define collaborative filtering?
Collaborative filtering is a simple algorithm to create a recommendation system based on user behavioral data. The most important components of collaborative filtering are users- items- interest.
A good example of collaborative filtering is when you see a statement like “recommended for you” on online shopping sites that’s pops out based on your browsing history.
How to deal the multi-source problems?
To deal the multi-source problems,
Restructuring of schemas to accomplish a schema integration
Identify similar records and merge them into single record containing all relevant attributes without redundancy
Can you define Outlier?
The outlier is a commonly used terms by analysts referred for a value that appears far away and diverges from an overall pattern in a sample. There are two types of Outliers
Univariate
Multivariate
Can you explain Design of experiments?
Design of experiments (DOE) is a systematic, rigorous approach to engineering problem-solving that applies principles and techniques at the data collection stage so as to ensure the generation of valid, defensible, and supportable engineering conclusions. In addition, all of this is carried out under the constraint of a minimal expenditure of engineering runs, time, and money.
Can you define EDD?
Experimental Design Diagram (EDD) is a diagram used in science classrooms to design an experiment. This diagram helps to identify the essential components of an experiment.
Can you explain 80/20 rule?
It means that 80 percent of your income comes from 20 percent of your client’s. In other words, By the numbers it means that 80 percent of your outcomes come from 20 percent of your inputs. As Pareto demonstrated with his research this “rule” holds true, in a very rough sense, to an 80/20 ratio, however in many cases the ratio can be a lot higher – 99/1 may be closer to reality.
Can you define Map Reduce?
Map-reduce is a framework to process large data sets, splitting them into subsets, processing each subset on a different server and then blending results obtained on each.
What is KNN imputation method?
The KNN method is generally used to rectify the two of the similar attributes/terms in a data sheet.
Why using KNN?
KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for dealing with all kind of missing data.
Can you define MNAR?
Missing Not at Random (MNAR): When the missing values are neither MCAR nor MAR. In the previous example that would be the case if people tended not to answer the survey depending on their depression level.
What is regression imputation?
Regression imputation has the opposite problem of mean imputation. A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing.
What are some best tools that can be useful for data-analysis?
- Tableau
- RapidMiner
- OpenRefine
- KNIME
- Solver
- NodeXL
- Io
- Wolfram Alpha’s
- Google Fusion tables
- Google Search Operators and more