Python Pandas Interview Questions and Answers for freshers & experienced candidates 2023.
Pandas is an open source powerful and flexible tool for data analysis and manipulation library used for working with structured data. It provides data structures for efficiently storing and manipulating large datasets, as well as tools for data cleaning, grouping, filtering, and aggregation. Pandas is built on top of the NumPy library, and it integrates seamlessly with other data analysis tools in the Python ecosystem, such as Matplotlib, SciPy, and Scikit-Learn. Pandas provides a wide range of functions and methods for data manipulation, including merging, joining, pivoting, and reshaping data. It also has built-in support for handling missing data, time-series data, and categorical data.
Data structures: It provides two primary data structures for storing and manipulating data: Series and DataFrame. Series are one-dimensional arrays that can hold any data type, while DataFrames are two-dimensional arrays with rows and columns. Input/output: It can read data from various file formats, such as CSV, Excel, SQL databases, and JSON, and it can export data to these formats as well. Visualization: It integrates with other Python libraries, such as Matplotlib and Seaborn, to provide powerful visualization tools for data analysis and exploration. Data cleaning: It provides a wide range of functions for cleaning and transforming data, such as dropping missing values, filling in missing values, and replacing values. Data manipulation: It has many powerful functions for manipulating data, such as merging, joining, and grouping data, as well as pivoting and reshaping data. Indexing and selection: It provides a flexible and powerful system for indexing and selecting data, allowing you to select subsets of data based on various criteria. Time-series data: It has built-in support for working with time-series data, including functions for resampling, rolling windows, and time-zone handling.
Pandas provides two primary data structures for storing and manipulating data: Series and DataFrame. Series: A Series is a one-dimensional labelled array that can hold any data type (integer, float, string, etc.). It is similar to a Python list or a NumPy array, but with added labels or indices for each element, allowing for more powerful indexing and selection. You can create a Series object using the pd.Series() function. DataFrame: A DataFrame is a two-dimensional labelled data structure with rows and columns, similar to a spreadsheet or SQL table. It can hold a variety of data types (numeric, categorical, text, etc.) in each column, and can have row and column labels. You can create a DataFrame object using the pd.DataFrame() function. Pandas also provides additional data structures, such as Panel (a three-dimensional labeled array) and Panel4D (a four-dimensional labeled array), but these are less commonly used compared to Series and DataFrame.
 An index is a unique identifier for each row or column in a DataFrame or a Series. It is a fundamental component of data structures in Pandas and is used to perform various data manipulation tasks. There are two main types of indexes in Pandas: Numeric Index: It is a sequence of integers that starts from 0 and increments by 1 for each row or column. It is the default index type for both Series and DataFrame objects in Pandas. Label Index: It is a sequence of user-defined labels that can be used to uniquely identify each row or column in a DataFrame or a Series. Label indexes can be created using the Index class in Pandas or by specifying the index parameter when creating a DataFrame or a Series object. There are several other types of indexes available in Pandas, including: MultiIndex: It is also known as a hierarchical index, is a type of label index that allows for indexing and selecting data based on multiple levels of labels. DatetimeIndex: It is a type of label index that is used for time-series data. CategoricalIndex: It is a type of index that is used for working with categorical data.
Reindexing is a powerful tool in Pandas. It is the process of changing the index labels of a Pandas object (Series, DataFrame) to a new set of labels. It is a powerful feature of Pandas that allows you to change the order, add or delete labels, and fill missing values with specific methods. It is often used in data cleaning and manipulation tasks. In Pandas, you can reindex a Series or a DataFrame using the reindex() method. The reindex() method creates a new object with the specified index labels, filling any missing values with NaN by default.
Data aggregation in Pandas refers to the process of grouping data and applying functions to the grouped data, such as sum, mean, count, and other statistical calculations. It is a common data manipulation task in data analysis, and Pandas provides several functions for performing data aggregation. In Pandas, you can perform data aggregation using the groupby() method, which groups the data by one or more columns and applies a function to each group.
Categorical data in Pandas refers to a type of data that consists of values that can take on a limited and predefined set of possible values. Examples of categorical data include the gender of a person, the type of car, or the colour of a fruit. It is represented using the Categorical data type, which is a type of data that is optimized for working with categorical data.
There are several ways to create a DataFrame in Pandas: List: A DataFrame can be created from a list of dictionaries, where each dictionary represents a row of data. For example: Dictionary: A DataFrame can be created from a dictionary of lists, where each key-value pair represents a column of data. For example: NumPy array: A DataFrame can be created from a NumPy array, where each row of the array represents a row of data and each column of the array represents a column of data. For example: import pandas as pd
To add and delete columns and rows in Pandas, you can use the following methods: Adding columns: You can add a new column to a DataFrame by assigning a new Series to it, or by using the insert() method. For example: Deleting columns: You can delete a column from a DataFrame using the drop() method with the axis=1 parameter. For example: Adding rows: You can add a new row to a DataFrame by appending a new Series or dictionary to it, or by using the loc[] method. For example: df = df.append(pd.Series([10, 11], index=[‘A’, ‘B’]), ignore_index=True)Â # add a new row by appending a new Series Deleting rows: You can delete a row from a DataFrame using the drop() method with the index of the row and axis=0 parameter. For example:
In Pandas, you can set one or more columns as the index of a DataFrame using the set_index() method. This method returns a new DataFrame with the specified column(s) as the index. Here is an example of setting a single column as the index: Here is an example of setting multiple columns as the index: You can also use the inplace=True parameter to modify the DataFrame in place, without creating a new one: If you want to reset the index of a DataFrame to its default (0, 1, 2, …), you can use the reset_index() method:
Data operations in Pandas involve manipulating, transforming, and analyzing data in a variety of ways. Some common data operations in Pandas include: Filtering: Selecting a subset of data based on specific criteria, such as values in one or more columns. This can be done using boolean indexing or the query() method. Sorting: Arranging data in ascending or descending order based on one or more columns. This can be done using the sort_values() method. Grouping: Aggregating data by one or more columns and applying a function to each group. This can be done using the groupby() method. Aggregating: Calculating summary statistics for groups of data, such as mean, median, standard deviation, and count. This can be done using methods such as mean(), median(), std(), and count(). Merging and joining: Combining multiple DataFrames based on one or more common columns. This can be done using the merge() and join() methods. Reshaping: Transforming data from one format to another, such as pivoting data from long to wide format or stacking data from wide to long format. This can be done using methods such as pivot(), stack(), and unstack(). Applying functions: Applying custom functions to one or more columns or rows of data. This can be done using the apply() method.
To create an empty DataFrame in Pandas, you can use the DataFrame() constructor and pass an empty dictionary or a list of column names as the argument. Here are the examples: Creating an empty DataFrame with no columns: import pandas as pd df = pd.DataFrame() Creating an empty DataFrame with columns: import pandas as pd columns = [‘col1’, ‘col2’, ‘col3’] df = pd.DataFrame(columns=columns) In both cases, the resulting DataFrame will have no rows, but the second example will have three columns named ‘col1’, ‘col2’, and ‘col3’. You can add rows to the DataFrame using methods such as append() or loc[].
In Pandas, a Series is a one-dimensional labeled array that can hold any data type such as integers, floats, strings, and even Python objects. To create a Series in Pandas, you can use the Series() constructor and pass a list or array of values as the argument. Here is an example of creating a Series: data = [1, 2, 3, 4, 5] s = pd.Series(data)
To create a copy of a Series in Pandas, you can use the copy() method. The copy() method creates a deep copy of the Series, which means that any changes made to the copy will not affect the original Series. Here is an example of creating a copy of a Series: data = [1, 2, 3, 4, 5] s = pd.Series(data) s_copy = s.copy() In this example, s_copy is a deep copy of s, and any changes made to s_copy will not affect s.
To convert a NumPy array into a DataFrame in Pandas, you can use the pd.DataFrame() constructor and pass the NumPy array as the argument. Here’s an example: import numpy as np import pandas as pd # create a 2D NumPy array data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # convert the NumPy array to a DataFrame df = pd.DataFrame(data) # print the DataFrame print(df) This will output:   0 1 2 0 1 2 3 1 4 5 6 2 7 8 9 By default, the columns of the DataFrame will be numbered starting from 0. If you want to specify custom column names, you can pass them as a list to the columns parameter of the pd.DataFrame() constructor, like this: import numpy as np import pandas as pd # create a 2D NumPy array data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # specify column names columns = [‘A’, ‘B’, ‘C’] # convert the NumPy array to a DataFrame with custom column names df = pd.DataFrame(data, columns=columns) # print the DataFrame print(df) This will output: Copy code   A B C 0 1 2 3 1 4 5 6 2 7 8 9
What do you know about pandas?
What are the important features of the Pandas Library?
Here are some of the most important features of the Pandas library:
What are the important data structures used in Pandas?
Can you explain Index in Pandas?
Can you explain Reindexing?
What do you know about data aggregation in pandas?
What is categorical data in Panda?
What are the different ways in which a DataFrame can be created in Pandas?
import pandas as pd
data = [{'name': 'Aliya', 'age': 23, 'gender': 'F'},
   {'name': 'alex', 'age': 33, 'gender': 'M'},
   {'name': 'Jack', 'age': 38, 'gender': 'M'}]      Â
df = pd.DataFrame(data)
data = {'name': ['Aliya', 'Alex', 'Jack'],
       'age': [23, 33, 38],
       'gender': ['F', 'M', 'M']}
df = pd.DataFrame(data)
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
import sqlite3
conn = sqlite3.connect('data.db')
df = pd.read_sql('SELECT * FROM table', conn)
df = pd.read_csv('data.csv')
How would you add and delete columns and rows in Pandas?
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['C'] = [7, 8, 9]Â # add a new column by assigning a new Series
df.insert(1, 'D', [10, 11, 12]) Â # add a new column using the insert() method
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df = df.drop('C', axis=1)Â # delete the 'C' column
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.loc[3] = [7, 8]Â # add a new row using the loc[] method
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df.drop(1, axis=0)Â # delete the row with index 1
How do you set and reset indexs in Pandas?
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df = df.set_index('A')Â # set column 'A' as the index
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df = df.set_index(['A', 'B'])Â # set columns 'A' and 'B' as the index
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df.set_index('A', inplace=True)Â # set column 'A' as the index in place
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df.reset_index(inplace=True)Â # reset the index to the default
What do you understand about Data Operations in Pandas?
How to create an empty DataFrame in Pandas?
Can you explain Series In pandas?
How to create copy of series in pandas?
How will you convert a NumPy array into a DataFrame in Pandas?