Module 4d: Saving and loading a dataframe

Module 4d: Saving and loading a dataframe#

In this module you will learn how:

  • to save and load a dataframe.

import pandas as pd

# load the dataframe from the previous section
transposed_dict_df = pd.read_csv("transposed_dict_df-4c.csv")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
      3 # load the dataframe from the previous section
      4 transposed_dict_df = pd.read_csv("transposed_dict_df-4c.csv")

ModuleNotFoundError: No module named 'pandas'

After spending a while on preparing your dataframe, it would be worthwhile to save it for later reuse. One of the common file formats for pandas dataframes are comma-separated csv files or tab-separated files. For example, tab-based files are frequently used in bioinformatics analysis of sequencing data, often kept in tab-delimited BED format.

You can save your tab-separated tsv file with transposed_dict_df.to_csv(). The method looks confusing with the csv-oriented name, but it works perfectly fine with tab files, because you can specify the delimiter with sep keyword. Tab spacing is indicated with \t.

# remove indexing, so it's easier to read the file later
# header is removed to highlight one data loading issue (discussed later)
transposed_dict_df.to_csv("example_dataframe.tsv", header=None, index=False, sep="\t") 

Now let’s try loading our tab-seperated file into a pandas dataframe:

pd.read_csv("example_dataframe.tsv", sep="\t")
0 Unnamed: 1 Unnamed: 2 Unnamed: 3 70.9 False ('Asthma', 'Diabetes')
0 1 male 40.0 1.88 50.0 True ()
1 2 male 18.0 1.65 73.0 True ('Lung cancer',)
2 3 female 83.0 1.72 87.0 False ('Cardio vascular disease', 'Alzheimers')
3 4 female 55.0 1.68 50.0 False ('Asthma', 'Anxiety')
4 5 NaN NaN NaN 64.0 True ('Diabetes',)
5 6 male 21.0 1.90 92.9 False ('Asthma', 'Colon cancer')
6 7 NaN NaN NaN 75.4 False ()
7 8 female 32.0 1.66 90.7 True ('Depression',)
8 9 male 67.0 1.78 82.3 False ('Anxiety', 'Diabetes', 'Cardio vascular disea...
9 participant11 female 34.0 1.55 64.0 False ()

We’ve loaded the file - but look at the very first row! It is treated as column names automatically. When loading datasets, you might frequently encounter dataframes that do not have header column names prespecified, so it is good to specify header = None when loading a file like this.

tab_file = pd.read_csv("example_dataframe.tsv", delimiter="\t", header=None)
tab_file
0 1 2 3 4 5 6
0 0 NaN NaN NaN 70.9 False ('Asthma', 'Diabetes')
1 1 male 40.0 1.88 50.0 True ()
2 2 male 18.0 1.65 73.0 True ('Lung cancer',)
3 3 female 83.0 1.72 87.0 False ('Cardio vascular disease', 'Alzheimers')
4 4 female 55.0 1.68 50.0 False ('Asthma', 'Anxiety')
5 5 NaN NaN NaN 64.0 True ('Diabetes',)
6 6 male 21.0 1.90 92.9 False ('Asthma', 'Colon cancer')
7 7 NaN NaN NaN 75.4 False ()
8 8 female 32.0 1.66 90.7 True ('Depression',)
9 9 male 67.0 1.78 82.3 False ('Anxiety', 'Diabetes', 'Cardio vascular disea...
10 participant11 female 34.0 1.55 64.0 False ()