Top+ Python Pandas Interview Questions and Answers 2023

Python Pandas Interview Questions and Answers for freshers & experienced candidates 2023.

What do you know about pandas?

Pandas is an open source powerful and flexible tool for data analysis and manipulation library used for working with structured data. It provides data structures for efficiently storing and manipulating large datasets, as well as tools for data cleaning, grouping, filtering, and aggregation. Pandas is built on top of the NumPy library, and it integrates seamlessly with other data analysis tools in the Python ecosystem, such as Matplotlib, SciPy, and Scikit-Learn.

Pandas provides a wide range of functions and methods for data manipulation, including merging, joining, pivoting, and reshaping data. It also has built-in support for handling missing data, time-series data, and categorical data.

What are the important features of the Pandas Library?

Here are some of the most important features of the Pandas library:

Data structures: It provides two primary data structures for storing and manipulating data: Series and DataFrame. Series are one-dimensional arrays that can hold any data type, while DataFrames are two-dimensional arrays with rows and columns.

Input/output: It can read data from various file formats, such as CSV, Excel, SQL databases, and JSON, and it can export data to these formats as well.

Visualization: It integrates with other Python libraries, such as Matplotlib and Seaborn, to provide powerful visualization tools for data analysis and exploration.

Data cleaning: It provides a wide range of functions for cleaning and transforming data, such as dropping missing values, filling in missing values, and replacing values.

Data manipulation: It has many powerful functions for manipulating data, such as merging, joining, and grouping data, as well as pivoting and reshaping data.

Indexing and selection: It provides a flexible and powerful system for indexing and selecting data, allowing you to select subsets of data based on various criteria.

Time-series data: It has built-in support for working with time-series data, including functions for resampling, rolling windows, and time-zone handling.

What are the important data structures used in Pandas?

Pandas provides two primary data structures for storing and manipulating data: Series and DataFrame.

Series: A Series is a one-dimensional labelled array that can hold any data type (integer, float, string, etc.). It is similar to a Python list or a NumPy array, but with added labels or indices for each element, allowing for more powerful indexing and selection. You can create a Series object using the pd.Series() function.

DataFrame: A DataFrame is a two-dimensional labelled data structure with rows and columns, similar to a spreadsheet or SQL table. It can hold a variety of data types (numeric, categorical, text, etc.) in each column, and can have row and column labels. You can create a DataFrame object using the pd.DataFrame() function.

Pandas also provides additional data structures, such as Panel (a three-dimensional labeled array) and Panel4D (a four-dimensional labeled array), but these are less commonly used compared to Series and DataFrame.

Can you explain Index in Pandas?

An index is a unique identifier for each row or column in a DataFrame or a Series. It is a fundamental component of data structures in Pandas and is used to perform various data manipulation tasks. There are two main types of indexes in Pandas:

Numeric Index: It is a sequence of integers that starts from 0 and increments by 1 for each row or column. It is the default index type for both Series and DataFrame objects in Pandas.

Label Index: It is a sequence of user-defined labels that can be used to uniquely identify each row or column in a DataFrame or a Series. Label indexes can be created using the Index class in Pandas or by specifying the index parameter when creating a DataFrame or a Series object.

There are several other types of indexes available in Pandas, including:

MultiIndex: It is also known as a hierarchical index, is a type of label index that allows for indexing and selecting data based on multiple levels of labels.

DatetimeIndex: It is a type of label index that is used for time-series data.

CategoricalIndex: It is a type of index that is used for working with categorical data.

Can you explain Reindexing?

Reindexing is a powerful tool in Pandas. It is the process of changing the index labels of a Pandas object (Series, DataFrame) to a new set of labels. It is a powerful feature of Pandas that allows you to change the order, add or delete labels, and fill missing values with specific methods. It is often used in data cleaning and manipulation tasks.

In Pandas, you can reindex a Series or a DataFrame using the reindex() method. The reindex() method creates a new object with the specified index labels, filling any missing values with NaN by default.

What do you know about data aggregation in pandas?

Data aggregation in Pandas refers to the process of grouping data and applying functions to the grouped data, such as sum, mean, count, and other statistical calculations. It is a common data manipulation task in data analysis, and Pandas provides several functions for performing data aggregation.

In Pandas, you can perform data aggregation using the groupby() method, which groups the data by one or more columns and applies a function to each group.

What is categorical data in Panda?

What are the different ways in which a DataFrame can be created in Pandas?

There are several ways to create a DataFrame in Pandas:

List: A DataFrame can be created from a list of dictionaries, where each dictionary represents a row of data.

For example:

import pandas as pd

data = [{'name': 'Aliya', 'age': 23, 'gender': 'F'},

{'name': 'alex', 'age': 33, 'gender': 'M'},

{'name': 'Jack', 'age': 38, 'gender': 'M'}]

df = pd.DataFrame(data)

Dictionary: A DataFrame can be created from a dictionary of lists, where each key-value pair represents a column of data. For example:

data = {'name': ['Aliya', 'Alex', 'Jack'],

'age': [23, 33, 38],

'gender': ['F', 'M', 'M']}

df = pd.DataFrame(data)

NumPy array:

A DataFrame can be created from a NumPy array, where each row of the array represents a row of data and each column of the array represents a column of data. For example:

import pandas as pd

import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

SQL database : A DataFrame can be created from a SQL database using the read_sql() function in Pandas. For example:

import sqlite3

conn = sqlite3.connect('data.db')

df = pd.read_sql('SELECT * FROM table', conn)

CSV file : Dataframe can be created from a CSV file using the read_csv() function in Pandas. For example:

import pandas as pd

df = pd.read_csv('data.csv')

How would you add and delete columns and rows in Pandas?

To add and delete columns and rows in Pandas, you can use the following methods:

Adding columns: You can add a new column to a DataFrame by assigning a new Series to it, or by using the insert() method. For example:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

df['C'] = [7, 8, 9] # add a new column by assigning a new Series

df.insert(1, 'D', [10, 11, 12]) # add a new column using the insert() method

Deleting columns: You can delete a column from a DataFrame using the drop() method with the axis=1 parameter. For example:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

df = df.drop('C', axis=1) # delete the 'C' column

Adding rows: You can add a new row to a DataFrame by appending a new Series or dictionary to it, or by using the loc[] method. For example:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

df.loc[3] = [7, 8] # add a new row using the loc[] method

df = df.append(pd.Series([10, 11], index=[‘A’, ‘B’]), ignore_index=True) # add a new row by appending a new Series

Deleting rows: You can delete a row from a DataFrame using the drop() method with the index of the row and axis=0 parameter. For example:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

df = df.drop(1, axis=0) # delete the row with index 1

How do you set and reset indexs in Pandas?

In Pandas, you can set one or more columns as the index of a DataFrame using the set_index() method. This method returns a new DataFrame with the specified column(s) as the index.

Here is an example of setting a single column as the index:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

df = df.set_index('A') # set column 'A' as the index

Here is an example of setting multiple columns as the index:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

df = df.set_index(['A', 'B']) # set columns 'A' and 'B' as the index

You can also use the inplace=True parameter to modify the DataFrame in place, without creating a new one:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

df.set_index('A', inplace=True) # set column 'A' as the index in place

If you want to reset the index of a DataFrame to its default (0, 1, 2, …), you can use the reset_index() method:

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

df.reset_index(inplace=True) # reset the index to the default

What do you understand about Data Operations in Pandas?

Data operations in Pandas involve manipulating, transforming, and analyzing data in a variety of ways. Some common data operations in Pandas include:

Filtering: Selecting a subset of data based on specific criteria, such as values in one or more columns. This can be done using boolean indexing or the query() method.

Sorting: Arranging data in ascending or descending order based on one or more columns. This can be done using the sort_values() method.

Grouping: Aggregating data by one or more columns and applying a function to each group. This can be done using the groupby() method.

Aggregating: Calculating summary statistics for groups of data, such as mean, median, standard deviation, and count. This can be done using methods such as mean(), median(), std(), and count().

Merging and joining: Combining multiple DataFrames based on one or more common columns. This can be done using the merge() and join() methods.

Reshaping: Transforming data from one format to another, such as pivoting data from long to wide format or stacking data from wide to long format. This can be done using methods such as pivot(), stack(), and unstack().

Applying functions: Applying custom functions to one or more columns or rows of data. This can be done using the apply() method.

How to create an empty DataFrame in Pandas?

To create an empty DataFrame in Pandas, you can use the DataFrame() constructor and pass an empty dictionary or a list of column names as the argument. Here are the examples:

Creating an empty DataFrame with no columns:

import pandas as pd

df = pd.DataFrame()

Creating an empty DataFrame with columns:

import pandas as pd

columns = [‘col1’, ‘col2’, ‘col3’]

df = pd.DataFrame(columns=columns)

In both cases, the resulting DataFrame will have no rows, but the second example will have three columns named ‘col1’, ‘col2’, and ‘col3’. You can add rows to the DataFrame using methods such as append() or loc[].

Can you explain Series In pandas?

How to create copy of series in pandas?

To create a copy of a Series in Pandas, you can use the copy() method. The copy() method creates a deep copy of the Series, which means that any changes made to the copy will not affect the original Series.

Here is an example of creating a copy of a Series:

data = [1, 2, 3, 4, 5]

s = pd.Series(data)

s_copy = s.copy()

In this example, s_copy is a deep copy of s, and any changes made to s_copy will not affect s.

How will you convert a NumPy array into a DataFrame in Pandas?

To convert a NumPy array into a DataFrame in Pandas, you can use the pd.DataFrame() constructor and pass the NumPy array as the argument. Here’s an example:

import numpy as np

import pandas as pd

# create a 2D NumPy array

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# convert the NumPy array to a DataFrame

df = pd.DataFrame(data)

# print the DataFrame

print(df)

This will output:

0 1 2

0 1 2 3

1 4 5 6

2 7 8 9

By default, the columns of the DataFrame will be numbered starting from 0. If you want to specify custom column names, you can pass them as a list to the columns parameter of the pd.DataFrame() constructor, like this:

import numpy as np

import pandas as pd

# create a 2D NumPy array

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# specify column names

columns = [‘A’, ‘B’, ‘C’]

# convert the NumPy array to a DataFrame with custom column names

df = pd.DataFrame(data, columns=columns)

# print the DataFrame

print(df)

This will output:

Copy code

A B C

0 1 2 3

1 4 5 6

2 7 8 9

What do you know about pandas?

What are the important features of the Pandas Library?

What are the important data structures used in Pandas?

Can you explain Index in Pandas?

Can you explain Reindexing?

What do you know about data aggregation in pandas?

What is categorical data in Panda?

What are the different ways in which a DataFrame can be created in Pandas?

How would you add and delete columns and rows in Pandas?

How do you set and reset indexs in Pandas?

What do you understand about Data Operations in Pandas?

How to create an empty DataFrame in Pandas?

Can you explain Series In pandas?

How to create copy of series in pandas?

How will you convert a NumPy array into a DataFrame in Pandas?

Related Posts