import pandas as pd
9 Data handling with Pandas
In this chapter, we will go over some of the basics of importing, adjusting and exporting data in Python. For the adjusting part, we will rely on the Pandas package, which is a data analysis package. We start by explaining how to use Pandas data frames, a convenient way to store large datasets. Afterwards, we will explain how you can import data from a file into a data frame, and how to export it to another file.
The Pandas package pandas
is typically imported under the alias pd
.
9.1 Data frames
We will start with a small data set consisting of six persons and some personal information about these people. The data is given in the following dictionary. It contains the names, height, weight, age and dietary preference (i.e., dictionary keys) of everyone.
= {
dataset 'name' : ["Aiden", "Bella", "Carlos", "Dalia", "Elena", "Farhan"],
'height (cm)' : [185, 155, 190, 185, 160, 170],
'weight (kg)' : [80, 60, 100, 85, 62, 75],
'age (years)' : [23, 23, 23, 21, 19, 25],
'dietary preference' : ['Veggie','Veggie','None','None','Vegan','None']
}
print(dataset.keys())
dict_keys(['name', 'height (cm)', 'weight (kg)', 'age (years)', 'dietary preference'])
A dictionary is not a convenient datatype to perform data analysis on. Therefore, we load the data into a so-called dataframe using the DataFrame()
function from pandas
.
= pd.DataFrame(dataset)
data_frame
print(data_frame)
name height (cm) weight (kg) age (years) dietary preference
0 Aiden 185 80 23 Veggie
1 Bella 155 60 23 Veggie
2 Carlos 190 100 23 None
3 Dalia 185 85 21 None
4 Elena 160 62 19 Vegan
5 Farhan 170 75 25 None
As you can see here, the keys of the dictionary become the column names of the data frame, and the values are stored in the corresponding column. You can also see the index of the row at the far left. Because we have a relatively small data set, the complete data frame is printed. If the data contains a large number of rows, typically the first and last five rows are printed in the console of Spyder. A data frame is an object of the type DataFrame
with which you can do all kinds of things.
print(type(data_frame))
<class 'pandas.core.frame.DataFrame'>
9.1.1 Accessing
If you want to print the first or last k rows, you can use the functions frame_name.head(k)
and frame_name.tail(k)
, respectively, with frame_name the name of the data frame.
# Print first three rows of data_frame
print(data_frame.head(3))
name height (cm) weight (kg) age (years) dietary preference
0 Aiden 185 80 23 Veggie
1 Bella 155 60 23 Veggie
2 Carlos 190 100 23 None
# Print last two rows of data_frame
print(data_frame.tail(2))
name height (cm) weight (kg) age (years) dietary preference
4 Elena 160 62 19 Vegan
5 Farhan 170 75 25 None
It is also possible to access specific rows or elements from the data frame using indexing.
To extract row i, use frame_name.loc[i]
.
# Extract row 0 (i.e., info of Aiden)
= data_frame.loc[0]
x
print(x)
name Aiden
height (cm) 185
weight (kg) 80
age (years) 23
dietary preference Veggie
Name: 0, dtype: object
To extract a specific column, use frame_name.loc[:,'column_name']
.
# Extract age column (i.e., info of Aiden)
= data_frame.loc[:,'age (years)']
x
print(x)
0 23
1 23
2 23
3 21
4 19
5 25
Name: age (years), dtype: int64
An alternative that not uses the loc[]
function is frame_name[column_name]
.
# Extract age column (i.e., info of Aiden)
= data_frame['age (years)']
x
print(x)
0 23
1 23
2 23
3 21
4 19
5 25
Name: age (years), dtype: int64
To extract from row i the entry in column ‘column_name’ use frame_name.loc[i,column_name]
.
# Extract the name and height from the person on row 0
= data_frame.loc[0,'name']
x = data_frame.loc[0,'height (cm)']
y
print("The height of", x, "is", y, "cm.")
The height of Aiden is 185 cm.
We can also use slicing to return a specified range of rows. For example, rows i through j can be obtained using frame_name.loc[i:j]
.
# Extract first three rows
= data_frame.loc[1:3]
x
print(x)
name height (cm) weight (kg) age (years) dietary preference
1 Bella 155 60 23 Veggie
2 Carlos 190 100 23 None
3 Dalia 185 85 21 None
A notable different with slicing in lists is that if we want the first, second AND third element of a list then we use [1:4]
. This is because the last element of the specified range (4 in this case) is not included when using slicing in lists.
# Slicing in list
= [13,4,5,2,11]
y
# Print first, second and third element of y
print(y[1:4])
[4, 5, 2]
Using slicing, we can also access specific combinations of columns and rows. Suppose we are only interested in the name, height and age of the first, second and third person in the frame. Because the columns have names (i.e., are not numbers), we index them by a list containing the colunmn names that we are interested in.
# Extract block with rows 1-3 and columns name, height and age.
= data_frame.loc[1:3,['name','height (cm)', 'age (years)']]
x
print(x)
name height (cm) age (years)
1 Bella 155 23
2 Carlos 190 23
3 Dalia 185 21
It is also possible to return a subset of rows that do not form a consective block. You can do this with a Boolean vector indicating for every row whether you want it to be included or not. For example, if we want to return only rows 0, 1, 4 and 5, we can do the following:
= [True,True,False,False,True,True]
rows = data_frame.loc[rows]
x
print(x)
name height (cm) weight (kg) age (years) dietary preference
0 Aiden 185 80 23 Veggie
1 Bella 155 60 23 Veggie
4 Elena 160 62 19 Vegan
5 Farhan 170 75 25 None
Here rows
is a Boolean list containing entries True
and False
with an element being True
if and only if we want the row to be included (namely 0,1,4 and 5), and False
otherwise (namely 2 and 3). We can achieve the same result with data_frame.loc[[0,1,4,5]]
, i.e, by giving a list of the row entries that we are intersted in.
The Boolean list approach is convenient, because it can also be used to select rows that satisfy a specified criterion. For example, suppose that we want to only select the rows of persons whose dietary preference is ‘None’. This can be achieved as follows.
# Boolean vector no_pref indicating whether dietary preference is 'None'
= data_frame['dietary preference'] == 'None'
no_pref print(no_pref)
0 False
1 False
2 True
3 True
4 False
5 True
Name: dietary preference, dtype: bool
# Extract rows for which list no_pref has 'True' entry
= data_frame.loc[no_pref]
x
print(x)
name height (cm) weight (kg) age (years) dietary preference
2 Carlos 190 100 23 None
3 Dalia 185 85 21 None
5 Farhan 170 75 25 None
The expression data_frame['Dietary preference'] == 'None'
checks for every row in the dietary preference column data_frame['Dietary preference']
whether its entry is ‘None’. If so, it returns True, and otherwise False. We store these True/False values in the list no_pref
(short for having no dietary preference). We then use this Boolean list to extract the rows of the data frame consisting of the persons whose dietary preference is None
.
9.1.2 Editing
It is also possible to edit the data frame, both the data in the frame, as well as the column and row names. For example, it might be that we start with data that is not given in a dictionary, but rather in a matrix (which is a list of lists, where each of the inner lists forms a row of the matrix).
= [
data 2,4,-1,2],
[5,1,2,9],
[3,7,8,9]
[
]
= pd.DataFrame(data)
frame
print(frame)
0 1 2 3
0 2 4 -1 2
1 5 1 2 9
2 3 7 8 9
Note that in this case both the rows and columns have their index number as name, so 0,1 and 2 for the rows and 0, 1, 2 and 3 for the columns. The names of the rows are stored in frame_name.index
and the columns in frame_name.columns
.
# Row names are stored in frame.index
print(frame.index)
RangeIndex(start=0, stop=3, step=1)
# Rename rows
= ['Row0','Row1','Row2']
frame.index
# Access the row names
print(frame.index)
Index(['Row0', 'Row1', 'Row2'], dtype='object')
If you want to access specific column names, you can use indexing.
# Print name of first column
print(frame.index[1])
Row1
Here is the complete frame with row names adjusted.
# Print data frame
print(frame)
0 1 2 3
Row0 2 4 -1 2
Row1 5 1 2 9
Row2 3 7 8 9
Next, let us adjust the colum names.
= ['Col0','Col1','Col2','Col3']
frame.columns
print(frame)
Col0 Col1 Col2 Col3
Row0 2 4 -1 2
Row1 5 1 2 9
Row2 3 7 8 9
It is also possible to alter the entries within the frame.
# Edit entry on row 1, column 2
'Row1','Col2'] = 10
frame.loc[
print(frame)
Col0 Col1 Col2 Col3
Row0 2 4 -1 2
Row1 5 1 10 9
Row2 3 7 8 9
You can also edit a complete row (or column)
# Replace row 2
'Row2',:] = [-2,-2,-2,-2]
frame.loc[
print(frame)
Col0 Col1 Col2 Col3
Row0 2 4 -1 2
Row1 5 1 10 9
Row2 -2 -2 -2 -2
# Replace column 2
'Col2'] = [-1,-1,-1]
frame.loc[:,
print(frame)
Col0 Col1 Col2 Col3
Row0 2 4 -1 2
Row1 5 1 -1 9
Row2 -2 -2 -1 -2
It is also possible to edit the entries of an entire column by applying a function to it using apply()
. For example, suppose that we want to square all the numbers in the second column. We can do this as follows.
def f(x):
return x**2
# frame['Col1'].apply(f) does not overwrite the entries in Col1
# so we have to do this ourselves
'Col1'] = frame['Col1'].apply(f)
frame[
print(frame)
Col0 Col1 Col2 Col3
Row0 2 16 -1 2
Row1 5 1 -1 9
Row2 -2 4 -1 -2
9.1.3 Adding data
It is also possible to add entire new rows and columns.
= [
data 2,4,-1,2],
[5,1,2,9],
[3,7,8,9]
[
]
# Create frame out of data
= pd.DataFrame(data)
frame
# Name rows and columns
= ['Col0','Col1','Col2','Col3']
frame.columns = ['Row0','Row1','Row2']
frame.index
print(frame)
Col0 Col1 Col2 Col3
Row0 2 4 -1 2
Row1 5 1 2 9
Row2 3 7 8 9
Next, we add a row to the data frame. The loc[]
command uses for this adds the row at the bottom of the current data frame.
# Add a row
'New row'] = [5,5,3,1]
frame.loc[
print(frame)
Col0 Col1 Col2 Col3
Row0 2 4 -1 2
Row1 5 1 2 9
Row2 3 7 8 9
New row 5 5 3 1
The same holds for adding a column, which is done as follows. Note that here we use [:,'New column']
and not ['New column']
, because the latter would add the new data as a row.
'New column'] = [1,1,1,1]
frame.loc[:,
print(frame)
Col0 Col1 Col2 Col3 New column
Row0 2 4 -1 2 1
Row1 5 1 2 9 1
Row2 3 7 8 9 1
New row 5 5 3 1 1
Adding a new column you can also do with the insert()
function. This allows you to specify at which position you want the column to be inserted. The insert()
function needs three arguments: a position i where the column should be inserted, the column name and the column data, so the syntax is something like instert(i,column_name,column_data)
.
# Insert column with name 'New column' at position 2.
2,'Inserted column', [10,10,10,10])
frame.insert(
print(frame)
Col0 Col1 Inserted column Col2 Col3 New column
Row0 2 4 10 -1 2 1
Row1 5 1 10 2 9 1
Row2 3 7 10 8 9 1
New row 5 5 10 3 1 1
Adding a row at a specific position is also possible, but this is more involved and omitted here.
9.2 Mathematical operations
It is also possible to obtain statistical information from numerical columns.
= [
data 2,4,-1,2,'hello'],
[5,1,2,9,'bye'],
[3,7,8,9,'hello'],
[3,5,8,9,'hi'],
[31,5,4,9,'hi'],
[3,7,8,5,'hello'],
[
]
# Create frame out of data
= pd.DataFrame(data)
frame
# Name rows and columns
= ['Col0','Col1','Col2','Col3','Col4']
frame.columns = ['Row0','Row1','Row2','Row3','Row4','Row5']
frame.index
print(frame)
Col0 Col1 Col2 Col3 Col4
Row0 2 4 -1 2 hello
Row1 5 1 2 9 bye
Row2 3 7 8 9 hello
Row3 3 5 8 9 hi
Row4 31 5 4 9 hi
Row5 3 7 8 5 hello
For example, we can compute the minimum, maximum and average value by using the function min()
, max()
and mean()
, respectively.
# Minimum of the first column
= frame.loc[:,'Col1'].min()
min_col1
print(min_col1)
1
# Maximum of the second column
= frame.loc[:,'Col2'].max()
max_col2
print(max_col2)
8
# Mean of the zeroth column
= frame.loc[:,'Col0'].mean()
mean_col0
print(mean_col0)
7.833333333333333
It is also possible to count occurences of a given word
(or number) using value_counts()[word]
. For example, suppose we want to count how often the word 'hello'
appears in the third column.
= frame.loc[:,'Col4'].value_counts()['hello']
count_hello
print(count_hello)
3
We can also to more advanced things like counting the total number of occurrences of every word in the fourth column. By having a quick look at the data, we see that there are three distinct greetings, 'hello'
, 'hi'
and 'bye'
, to be found in the fourth column. A quick way to obtain these greetings in a list is to use the unique()
function. This function returns a list with all the unique entries found in the specified column.
= frame.loc[:,'Col4'].unique()
greetings
print(greetings)
['hello' 'bye' 'hi']
Next, we can loop over the greetings in the list greetings
and apply the value_counts()
function to all of them. We store the results in a dictionary whose keys are the greetings in greetings
and whose values are the number of times every greeting appears in the fourth column.
# Create empty dictionary
= {}
occur_count
for i in greetings:
= frame.loc[:,'Col4'].value_counts()[i]
occur_count[i]
print(occur_count)
{'hello': 3, 'bye': 1, 'hi': 2}
It is usually insightful to visualize your data as well. In this case we can make a bar plot using the bar()
function of matplotlib.pyplot
. It takes as input two lists (or arrays), the first list being the labels that should appear under the bars, and the second list the heights of the bars.
import matplotlib.pyplot as plt
# Labels of the bars (keys of dictionary occur_count)
= occur_count.keys()
labels print(labels)
# Heights of the bars (values of dictionary occur_count)
= occur_count.values()
values print(values)
# Create figure
plt.figure()
# Create bar plot
plt.bar(labels,values)
#Create title for plot
'Greeting occurrences') plt.title(
dict_keys(['hello', 'bye', 'hi'])
dict_values([3, 1, 2])
Text(0.5, 1.0, 'Greeting occurrences')
9.3 Importing and exporting data
Data is typically provided in an external file, for example, a comma-separated values (CSV) file. You can download the data that was given at the beginning of this section here in and you should store in under the name dataset.csv.
= pd.read_csv('dataset.csv')
csv_to_frame
print(csv_to_frame)
name height (cm) weight (kg) age (years) dietary preference
0 Aiden 185 80 23 Veggie
1 Bella 155 60 23 Veggie
2 Carlos 190 100 23 None
3 Dalia 185 85 21 None
4 Elena 160 62 19 Vegan
5 Farhan 170 75 25 None
6 Geert 178 80 25 Veggie
To test the code above in Spyder, you need to store the Python file that you execute the code in, in the same folder as the file dataset.csv.
The function read_csv
stores the data in the file dataset.csv. The first row of the data is assumed to contain the names of the columns. If the data file does not contain such a first row, we can import the data with the additional argument header=None
, meaning that we tell Python that there is no first row containing the column names.
= pd.read_csv('dataset.csv', header=None)
csv_to_frame
print(csv_to_frame)
0 1 2 3 4
0 name height (cm) weight (kg) age (years) dietary preference
1 Aiden 185 80 23 Veggie
2 Bella 155 60 23 Veggie
3 Carlos 190 100 23 None
4 Dalia 185 85 21 None
5 Elena 160 62 19 Vegan
6 Farhan 170 75 25 None
7 Geert 178 80 25 Veggie
Note that in the code above, the first row of the data is now included in the data frame, instead of set to be the names of the columns. The columns are now indexed by integers like the rows of the frame.
It is also possible to export an (adjusted) data frame to a comma-separated file. Let us first add another row to the existing frame and then export it to a new file called new_dataset.csv.
= pd.read_csv('dataset.csv')
frame
# Highest index in original frame is 5, so 6 is the index
# at which we place the new row
6] = ['Geert',178, 80, 25, 'Veggie']
frame.loc[
print(frame)
name height (cm) weight (kg) age (years) dietary preference
0 Aiden 185 80 23 Veggie
1 Bella 155 60 23 Veggie
2 Carlos 190 100 23 None
3 Dalia 185 85 21 None
4 Elena 160 62 19 Vegan
5 Farhan 170 75 25 None
6 Geert 178 80 25 Veggie
Now that we have added a new row, we can use the to_csv()
function tho store the frame in the new comma-separated file.
'new_dataset.csv') frame.to_csv(
The folder in which you have stored the original dataset.csv file, as well as the Python file in which the code is executed, should now contain a new file called new_dataset.csv. If you open the file in, e.g., Notepad (Windows) or Excel, you will see something like the following figure.
On the first line the column names can be found, and on the following lines the data from the frame. However, Python also exported the row indices 0,1,\dots,6. If you don’t want these indices to be included (they were also not contained in the original .csv file), you can use the argument index=False
in to_csv()
.
'new_dataset_no_indices.csv', index=False) frame.to_csv(
This time the resulting file does not have the row indices at the beginning of every line.
Instead of storing the data in a new file, we can also overwrite the original dataset.csv file.
'dataset.csv',index=False) frame.to_csv(
You should, however, be careful with overwriting files in this way. Always make sure you have a copy of the data stored somewhere else, in case something goes wrong!