8 Data handling with Pandas

In this chapter, we will go over some of the basics of importing, adjusting and exporting data in Python. For the adjusting part, we will rely on the Pandas package, which is a data analysis package. We start by explaining how to use Pandas data frames, a convenient way to store large datasets. Afterwards, we will explain how you can import data from a file into a data frame, and how to export it to another file.

The Pandas package pandas is typically imported under the alias pd.

import pandas as pd

8.1 Data frames

We will start with a small data set consisting of six persons and some personal information about these people. The data is given in the following dictionary. It contains the names, height, weight, age and dietary preference (i.e., dictionary keys) of everyone.

dataset = {
 'name' : ["Aiden", "Bella", "Carlos", "Dalia", "Elena", "Farhan"],
 'height (cm)' : [185, 155, 190, 185, 160, 170],
 'weight (kg)' : [80, 60, 100, 85, 62, 75],
 'age (years)' : [23, 23, 23, 21, 19, 25],
 'dietary preference' : ['Veggie','Veggie','None','None','Vegan','None']
}

print(dataset.keys())

dict_keys(['name', 'height (cm)', 'weight (kg)', 'age (years)', 'dietary preference'])

A dictionary is not a convenient datatype to perform data analysis on. Therefore, we load the data into a so-called data frame using the DataFrame() function from pandas.

df = pd.DataFrame(dataset)

print(df)

     name  height (cm)  weight (kg)  age (years) dietary preference
0   Aiden          185           80           23             Veggie
1   Bella          155           60           23             Veggie
2  Carlos          190          100           23               None
3   Dalia          185           85           21               None
4   Elena          160           62           19              Vegan
5  Farhan          170           75           25               None

As you can see here, the keys of the dictionary become the column names of the data frame, and the values are stored in the corresponding column. You can also see the index of the row at the far left. Here we call the data frame df (short for data frame), but you can name it whatever you like.

Because we have a relatively small data set, the complete data frame is printed. If the data contains a large number of rows, typically the first and last five rows are printed in the console of Spyder. A data frame is an object of the type DataFrame with which you can do all kinds of things.

print(type(df))

<class 'pandas.core.frame.DataFrame'>

8.1.1 Accessing

If you want to print the first or last k rows, you can use the functions df.head(k) and df.tail(k), respectively, with df being the name of your data frame.

# Print first three rows of df
print(df.head(3))

     name  height (cm)  weight (kg)  age (years) dietary preference
0   Aiden          185           80           23             Veggie
1   Bella          155           60           23             Veggie
2  Carlos          190          100           23               None

# Print last two rows of df
print(df.tail(2))

     name  height (cm)  weight (kg)  age (years) dietary preference
4   Elena          160           62           19              Vegan
5  Farhan          170           75           25               None

It is also possible to access specific rows or elements from the data frame using indexing.

To extract row i, use df.loc[i].

# Extract row 0 (i.e., info of Aiden)
x = df.loc[0]

print(x)

name                   Aiden
height (cm)              185
weight (kg)               80
age (years)               23
dietary preference    Veggie
Name: 0, dtype: object

To extract a specific column, use df.loc[:,'column_name'].

# Extract age column
x = df.loc[:,'age (years)']

print(x)

0    23
1    23
2    23
3    21
4    19
5    25
Name: age (years), dtype: int64

An alternative that not uses the loc[] function is df[column_name].

# Extract age column
x = df['age (years)']

print(x)

0    23
1    23
2    23
3    21
4    19
5    25
Name: age (years), dtype: int64

If a column name doesn’t contain spaces or special characters (apart from ’_’) then it’s also possible to extract a column using df.column_name. Let’s try this with the name column:

# Extract name column
x = df.name

print(x)

0     Aiden
1     Bella
2    Carlos
3     Dalia
4     Elena
5    Farhan
Name: name, dtype: object

To extract the entry from row i in column ‘column_name’ we use df.loc[i,column_name].

# Extract the name and height from the person on row 0:
x = df.loc[0,'name']
y = df.loc[0,'height (cm)']

print("The height of", x, "is", y, "cm.")

The height of Aiden is 185 cm.

We can also use slicing to return a specified range of rows. For example, rows i through j can be obtained using df.loc[i:j].

# Extract first three rows
x = df.loc[0:2]

print(x)

     name  height (cm)  weight (kg)  age (years) dietary preference
0   Aiden          185           80           23             Veggie
1   Bella          155           60           23             Veggie
2  Carlos          190          100           23               None

This works differently compared to slicing with lists. Recall there that if we want the first 3 elements of a list x then when we need to use x[0:3]. This is because the last element of the specified range (3 in this case) is not included when using slicing in lists:

# Slicing in list
y = [13,4,5,2,11]

# Print first, second and third element of y:
print(y[0:3])

[13, 4, 5]

Using slicing, we can also access specific combinations of columns and rows. Suppose we are only interested in the name, height and age of the first, second and third person in the frame. Because the columns have names (i.e., are not numbers), we index them by a list containing the column names that we are interested in.

# Extract block with the first 3 rows and columns name, height and age:
x = df.loc[0:2,['name','height (cm)', 'age (years)']]

print(x)

     name  height (cm)  age (years)
0   Aiden          185           23
1   Bella          155           23
2  Carlos          190           23

It is also possible to return a subset of rows that do not form a consecutive block. You can do this with a Boolean vector indicating for every row whether you want it to be included or not. For example, if we want to return only rows 0, 1, 4 and 5, we can do the following:

rows = [True,True,False,False,True,True]
x = df[rows]

print(x)

     name  height (cm)  weight (kg)  age (years) dietary preference
0   Aiden          185           80           23             Veggie
1   Bella          155           60           23             Veggie
4   Elena          160           62           19              Vegan
5  Farhan          170           75           25               None

Here rows is a Boolean list containing entries True and False with an element being True if and only if we want the row to be included (namely 0,1,4 and 5), and False otherwise (namely 2 and 3). We can achieve the same result with df.loc[[0,1,4,5]], i.e, by giving a list of the row entries that we are interested in.

The Boolean list approach is convenient, because it can also be used to select rows that satisfy a specified criterion. For example, suppose that we want to only select the rows of persons whose dietary preference is ‘None’. This can be achieved as follows.

# Boolean vector no_pref indicating whether dietary preference is 'None'
no_pref = df['dietary preference'] == 'None'
print(no_pref)

0    False
1    False
2     True
3     True
4    False
5     True
Name: dietary preference, dtype: bool

# Extract rows for which list no_pref has 'True' entry
x = df[no_pref]

print(x)

     name  height (cm)  weight (kg)  age (years) dietary preference
2  Carlos          190          100           23               None
3   Dalia          185           85           21               None
5  Farhan          170           75           25               None

The expression df['Dietary preference'] == 'None' checks for every row in the dietary preference column df['Dietary preference'] whether its entry is 'None'. If so, it returns True, and otherwise False. We store these True/False values in the list no_pref (short for having no dietary preference). We then use this Boolean list to extract the rows of the data frame consisting of the persons whose dietary preference is None.

To extract all the rows where no_pref is False we can do:

x = df[~no_pref]

print(x)

    name  height (cm)  weight (kg)  age (years) dietary preference
0  Aiden          185           80           23             Veggie
1  Bella          155           60           23             Veggie
4  Elena          160           62           19              Vegan

For another example, let’s get all the rows for people who are at least 23 years old:

x = df[df['age (years)'] >= 23]

print(x)

     name  height (cm)  weight (kg)  age (years) dietary preference
0   Aiden          185           80           23             Veggie
1   Bella          155           60           23             Veggie
2  Carlos          190          100           23               None
5  Farhan          170           75           25               None

You can also combine conditions with the logical & operator, provided each individual condition is in parentheses:

x = df[(df['age (years)'] >= 23) & (df['dietary preference'] == 'Veggie')]

print(x)

    name  height (cm)  weight (kg)  age (years) dietary preference
0  Aiden          185           80           23             Veggie
1  Bella          155           60           23             Veggie

To do an OR condition, we use |. To get all rows with people 23 and over OR with vegetarian preferences.

x = df[(df['age (years)'] >= 23) | (df['dietary preference'] == 'Veggie')]

print(x)

     name  height (cm)  weight (kg)  age (years) dietary preference
0   Aiden          185           80           23             Veggie
1   Bella          155           60           23             Veggie
2  Carlos          190          100           23               None
5  Farhan          170           75           25               None

8.1.2 Editing

It is also possible to edit the data frame, both the data in the frame, as well as the column and row names. For example, it might be that we start with data that is not given in a dictionary, but rather in a matrix (which is a list of lists, where each of the inner lists forms a row of the matrix).

data = [
[2,4,-1,2],
[5,1,2,9],
[3,7,8,9]
]

df = pd.DataFrame(data)

print(df)

   0  1  2  3
0  2  4 -1  2
1  5  1  2  9
2  3  7  8  9

Note that in this case both the rows and columns have their index number as name, so 0,1 and 2 for the rows and 0, 1, 2 and 3 for the columns. The names of the rows are stored in df.index and the columns in df.columns.

# Row names are stored in df.index
print(df.index)

RangeIndex(start=0, stop=3, step=1)

# Rename rows
df.index = ['Row0','Row1','Row2']

# Access the row names
print(df.index)

Index(['Row0', 'Row1', 'Row2'], dtype='object')

If you want to access specific column names, you can use indexing.

# Print name of first column
print(df.index[1])

Row1

Here is the complete frame with row names adjusted.

# Print data frame
print(df)

      0  1  2  3
Row0  2  4 -1  2
Row1  5  1  2  9
Row2  3  7  8  9

Next, let us adjust the column names.

df.columns = ['Col0','Col1','Col2','Col3']

print(df)

      Col0  Col1  Col2  Col3
Row0     2     4    -1     2
Row1     5     1     2     9
Row2     3     7     8     9

It is also possible to alter the entries within the frame.

# Edit entry on row 1, column 2
df.loc['Row1','Col2'] = 10

print(df)

      Col0  Col1  Col2  Col3
Row0     2     4    -1     2
Row1     5     1    10     9
Row2     3     7     8     9

You can also edit a complete row (or column)

# Replace row 2
df.loc['Row2',:] = [-2,-2,-2,-2]

print(df)

      Col0  Col1  Col2  Col3
Row0     2     4    -1     2
Row1     5     1    10     9
Row2    -2    -2    -2    -2

# Replace column 2
df.loc[:,'Col2'] = [-1,-1,-1]

print(df)

      Col0  Col1  Col2  Col3
Row0     2     4    -1     2
Row1     5     1    -1     9
Row2    -2    -2    -1    -2

It is also possible to edit the entries of an entire column by applying a function to it using apply(). For example, suppose that we want to square all the numbers in the second column. We can do this as follows.

def f(x):
    return x**2

# df['Col1'].apply(f) does not overwrite the entries in Col1
# so we have to do this ourselves
df['Col1'] = df['Col1'].apply(f)

print(df)

      Col0  Col1  Col2  Col3
Row0     2    16    -1     2
Row1     5     1    -1     9
Row2    -2     4    -1    -2

You can also define and apply simple functions in one go using what are called lambda functions. Let’s square all the entries in Col1 again but this time using a lambda function:

df['Col1'] = df['Col1'].apply(lambda x: x**2)
print(df)

      Col0  Col1  Col2  Col3
Row0     2   256    -1     2
Row1     5     1    -1     9
Row2    -2    16    -1    -2

8.1.3 Adding data

It is also possible to add entire new rows and columns.

data = [
[2,4,-1,2],
[5,1,2,9],
[3,7,8,9]
]

# Create frame out of data
df = pd.DataFrame(data)

# Name rows and columns
df.columns = ['Col0','Col1','Col2','Col3']
df.index = ['Row0','Row1','Row2']

print(df)

      Col0  Col1  Col2  Col3
Row0     2     4    -1     2
Row1     5     1     2     9
Row2     3     7     8     9

Next, we add a row to the data frame. The loc[] command uses for this adds the row at the bottom of the current data frame.

# Add a row
df.loc['New row'] = [5,5,3,1]

print(df)

         Col0  Col1  Col2  Col3
Row0        2     4    -1     2
Row1        5     1     2     9
Row2        3     7     8     9
New row     5     5     3     1

The same holds for adding a column, which is done as follows. Note that here we use [:,'New column'] and not ['New column'], because the latter would add the new data as a row.

df.loc[:,'New column'] = [1,1,1,1]

print(df)

         Col0  Col1  Col2  Col3  New column
Row0        2     4    -1     2           1
Row1        5     1     2     9           1
Row2        3     7     8     9           1
New row     5     5     3     1           1

It’s also possible to add a new column without using the loc[] command:

df['New column'] = [1,1,1,1]

print(df)

         Col0  Col1  Col2  Col3  New column
Row0        2     4    -1     2           1
Row1        5     1     2     9           1
Row2        3     7     8     9           1
New row     5     5     3     1           1

You can also add a new column with the insert() function. This allows you to specify at which position you want the column to be inserted. The insert() function needs three arguments: a position i where the column should be inserted, the column name and the column data, so the syntax is something like insert(i,column_name,column_data).

# Insert column with name 'New column' at position 2.
df.insert(2,'Inserted column', [10,10,10,10])

print(df)

         Col0  Col1  Inserted column  Col2  Col3  New column
Row0        2     4               10    -1     2           1
Row1        5     1               10     2     9           1
Row2        3     7               10     8     9           1
New row     5     5               10     3     1           1

Adding a row at a specific position is also possible, but this is more involved and omitted here.

8.2 Mathematical operations

It is also possible to obtain statistical information from numerical columns.

data = [
[2,4,-1,2,'hello'],
[5,1,2,9,'bye'],
[3,7,8,9,'hello'],
[3,5,8,9,'hi'],
[31,5,4,9,'hi'],
[3,7,8,5,'hello'],
]

# Create frame out of data
df = pd.DataFrame(data)

# Name rows and columns
df.columns = ['Col0','Col1','Col2','Col3','Col4']
df.index = ['Row0','Row1','Row2','Row3','Row4','Row5']

print(df)

      Col0  Col1  Col2  Col3   Col4
Row0     2     4    -1     2  hello
Row1     5     1     2     9    bye
Row2     3     7     8     9  hello
Row3     3     5     8     9     hi
Row4    31     5     4     9     hi
Row5     3     7     8     5  hello

For example, we can compute the minimum, maximum and average value by using the function min(), max() and mean(), respectively.

# Minimum of the first column
min_col1 = df['Col1'].min()

print(min_col1)

# Maximum of the second column
max_col2 = df['Col2'].max()

print(max_col2)

# Mean of the zeroth column
mean_col0 = df['Col0'].mean()

print(mean_col0)

7.833333333333333

It is also possible to count occurrences of a given word (or number) using value_counts()[word]. For example, suppose we want to count how often the word 'hello' appears in the third column:

count_hello = df['Col4'].value_counts()['hello']

print(count_hello)

We can also to more advanced things like counting the total number of occurrences of every word in the fourth column. By having a quick look at the data, we see that there are three distinct greetings, 'hello', 'hi' and 'bye', to be found in the fourth column. A quick way to obtain these greetings in a list is to use the unique() function. This function returns a list with all the unique entries found in the specified column.

greetings  = df['Col4'].unique()

print(greetings)

['hello' 'bye' 'hi']

Next, we can loop over the greetings in the list greetings and apply the value_counts() function to all of them. We store the results in a dictionary whose keys are the greetings in greetings and whose values are the number of times every greeting appears in the fourth column.

# Create empty dictionary
occur_count = {}

for i in greetings:
    occur_count[i] = df.loc[:,'Col4'].value_counts()[i]

print(occur_count)

{'hello': np.int64(3), 'bye': np.int64(1), 'hi': np.int64(2)}

8.3 Importing and exporting data

Data is typically provided in an external file, for example, a comma-separated values (CSV) file. You can download the data that was given at the beginning of this section here in and you should store in under the name dataset.csv.

df = pd.read_csv('dataset.csv')

print(df)

     name   height (cm)   weight (kg)   age (years)  dietary preference
0   Aiden           185            80            23              Veggie
1   Bella           155            60            23              Veggie
2  Carlos           190           100            23                None
3   Dalia           185            85            21                None
4   Elena           160            62            19               Vegan
5  Farhan           170            75            25                None
6   Geert           178            80            25              Veggie

To test the code above in Spyder, you need to store the Python file that you execute the code in, in the same folder as the file dataset.csv.

The function read_csv stores the data in the file dataset.csv. The first row of the data is assumed to contain the names of the columns. If the data file does not contain such a first row, we can import the data with the additional argument header=None, meaning that we tell Python that there is no first row containing the column names.

df = pd.read_csv('dataset.csv', header=None)

print(df)

        0             1             2             3                    4
0    name   height (cm)   weight (kg)   age (years)   dietary preference
1   Aiden           185            80            23               Veggie
2   Bella           155            60            23               Veggie
3  Carlos           190           100            23                 None
4   Dalia           185            85            21                 None
5   Elena           160            62            19                Vegan
6  Farhan           170            75            25                 None
7   Geert           178            80            25               Veggie

Note that in the code above, the first row of the data is now included in the data frame, instead of set to be the names of the columns. The columns are now indexed by integers like the rows of the frame.

It is also possible to export an (adjusted) data frame to a comma-separated file. Let us first add another row to the existing frame and then export it to a new file called new_dataset.csv.

df = pd.read_csv('dataset.csv')

# Highest index in original frame is 5, so 6 is the index 
# at which we place the new row
df.loc[6] = ['Geert',178, 80, 25, 'Veggie']

print(df)

     name   height (cm)   weight (kg)   age (years)  dietary preference
0   Aiden           185            80            23              Veggie
1   Bella           155            60            23              Veggie
2  Carlos           190           100            23                None
3   Dalia           185            85            21                None
4   Elena           160            62            19               Vegan
5  Farhan           170            75            25                None
6   Geert           178            80            25              Veggie

Now that we have added a new row, we can use the to_csv() function tho store the frame in the new comma-separated file.

df.to_csv('new_dataset.csv')

The folder in which you have stored the original dataset.csv file, as well as the Python file in which the code is executed, should now contain a new file called new_dataset.csv. If you open the file in, e.g., Notepad (Windows) or Excel, you will see something like the following figure.

On the first line the column names can be found, and on the following lines the data from the frame. However, Python also exported the row indices 0,1,\dots,6. If you don’t want these indices to be included (they were also not contained in the original .csv file), you can use the argument index=False in to_csv().

df.to_csv('new_dataset_no_indices.csv', index=False)

This time the resulting file does not have the row indices at the beginning of every line.

New data frame without row indices in .csv file

Instead of storing the data in a new file, we can also overwrite the original dataset.csv file.

df.to_csv('dataset.csv',index=False)

You should, however, be careful with overwriting files in this way. Always make sure you have a copy of the data stored somewhere else, in case something goes wrong!