Module 4b: Selecting rows and columns in pandas#

In this section you will learn how:

  • to select specific rows and columns in a dataframe,

  • to fill a dataframe with values.

Selecting rows and columns in a pandas dataframe follows different rules from accessing elements in NumPy arrays or Python lists. You can do that either by specifying their labels or indices.

import pandas as pd
import numpy as np
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
      2 import numpy as np

ModuleNotFoundError: No module named 'pandas'
# load the dataframe from the previous section, method is further explained in module 4d
transposed_dict_df = pd.read_csv("transposed_dict_df-4a.csv")

Selection by labels - columns#

This is one of the main ways to select a single column in a Pandas dataframe by its name: referring to the label as a string character in brackets:

# select a single column by name 
transposed_dict_df["height"]
0    175.00
1      1.79
2      1.75
3      1.63
4      1.68
5      1.59
6      1.90
7      1.71
8      1.66
9      1.78
Name: height, dtype: float64

Now if you want to select multiple columns, you need to put their names in a Python list, that then goes into the dataframe square brackets:

# select multiple columns by name
transposed_dict_df[["height", "weight"]]
height weight
0 175.00 70.9
1 1.79 103.4
2 1.75 85.1
3 1.63 55.9
4 1.68 50.0
5 1.59 64.0
6 1.90 92.9
7 1.71 75.4
8 1.66 90.7
9 1.78 82.3

Mini-exercise 1

Give names to the rows in transposed_dict_df as they used to be - i.e. row 0 gets label “participant1” etc. Don’t type participant labels manually, automate it.

# your code goes here
💡 Solution

One of the possible ways to achieve this result:

participant_labels = []
n_rows = transposed_dict_df.shape[0]
for i in range(0, n_rows):
    label = "participant"+str(i+1)
    participant_labels.append(label)
transposed_dict_df.index = participant_labels

Selection by labels - rows#

Let us reuse the dataframe that you updated in the exercise above. To select a row by its label, you need to use .loc:

# selecting a single row
print("Single row selection")
print(transposed_dict_df.loc["participant1"])
print("------")
print("Multiple row selection")
# selecting multiple rows - you need to put the labels in a list 
# like in multiple column selection previously
transposed_dict_df.loc[["participant1", "participant8"]]
Single row selection
gender                      female
age                             25
height                       175.0
weight                        70.9
smoking                      False
diseases    ('Asthma', 'Diabetes')
Name: participant1, dtype: object
------
Multiple row selection
gender age height weight smoking diseases
participant1 female 25 175.00 70.9 False ('Asthma', 'Diabetes')
participant8 male 46 1.71 75.4 False ()

Selecting both rows and columns by label#

If you want to select rows and columns by label together, you need to specify rows with .loc first and then in the next brackets list desired columns.

# single row and single column
print(transposed_dict_df.loc["participant8"]["weight"])
75.4

Mini-exercise 2

Select smoking and diseases data for participants 5 and 9 in one line using labels.

# your code goes here
💡 Solution

One of the possible ways to achieve this result:

print(transposed_dict_df.loc[["participant5", "participant9"]][["smoking", "diseases"]])

Selecting values by their integer position#

If you don’t want to use labels, row and column selection can be done with indices - you need to use .iloc[.., ..] for this. For example, to access diseases (column 5) present in participant 3 (row 2), you need to type transposed_dict_df.iloc[2, 5]. Pandas dataframes also use 0-based indexing, slicing rules are very similar to NumPy arrays.

# selecting a single value
print("Single value")
print(transposed_dict_df.iloc[2, 5])
print("----")
print("Range")
# select a range (the end position 7, 5 are not included in the selected range)
print(transposed_dict_df.iloc[2:7, 3:5])
print("----")
print("Specific values")
# selecting multiple specific values - provide indexes in a list
print(transposed_dict_df.iloc[[0, 5, 7], [0, 1, 2]])
print("----")
print("Entire row")
# select the entire row
print(transposed_dict_df.iloc[2, :])
print("----")
print("Entire column")
# select the entire column
print(transposed_dict_df.iloc[:, 0])
Single value
('Lung cancer',)
----
Range
              weight  smoking
participant3    85.1     True
participant4    55.9    False
participant5    50.0    False
participant6    64.0     True
participant7    92.9    False
----
Specific values
              gender  age  height
participant1  female   25  175.00
participant6  female   32    1.59
participant8    male   46    1.71
----
Entire row
gender                  male
age                       18
height                  1.75
weight                  85.1
smoking                 True
diseases    ('Lung cancer',)
Name: participant3, dtype: object
----
Entire column
participant1     female
participant2       male
participant3       male
participant4     female
participant5     female
participant6     female
participant7       male
participant8       male
participant9     female
participant10      male
Name: gender, dtype: object

Filling out values in a datafame#

Sometimes you might want to change values in a dataframe. Here are some examples for how to do it for a single value or larger data structures:

transposed_dict_df.iloc[[0, 5, 7], [0, 1, 2]] = np.nan # insert a null value

transposed_dict_df.iloc[[0, 5, 7], [0, 1, 2]]
gender age height
participant1 NaN NaN NaN
participant6 NaN NaN NaN
participant8 NaN NaN NaN
# can fill out the range of values with a numpy array, Python lists or a Pandas dataframe

revised_weight_height = np.array([[1.88, 1.65, 1.72], [50, 73, 87]])

transposed_dict_df.iloc[[1, 2, 3], [2, 3]] = revised_weight_height.T

transposed_dict_df.iloc[[1, 2, 3], [2, 3]]
height weight
participant2 1.88 50.0
participant3 1.65 73.0
participant4 1.72 87.0

Mini-exercise 3

Participant 1 has their height recorded in centimeters in transposed_dict_df. Change it to meters and update the dataframe.

# your code goes here
💡 Solution

One of the possible ways to achieve this result:

transposed_dict_df.loc["participant1"][["height"]] = 1.75

Save the dataframe to work on it in the next section. The method .to_csv will be explained in more detail in module 4d.

transposed_dict_df.to_csv("transposed_dict_df-4b.csv")