How to organize your data for use in Python: Boxplot example with statistical calculations.

Download original file: 9_Organizing_your_data_and_boxplot.ipynb

View original file in nbviewer: 9_Organizing_your_data_and_boxplot.ipynb

Organizing your data

Because spreadsheets are very flexable, people use them in very different ways and often organize their data in a form that is easy for humans to understand, but non-trivial for computers to parse.

We will take two sheets from one spreadsheet (stats_data.xlsx) to illustrate this point (dirty_sheet and clean_sheet). The sheet we call clean_sheet is a reorganized form of the original data in dirty_sheet. If you open up this file in a spreadsheet program, you will note that the original user had re-organized this data to plot it, so the original form was not even a good choice for use by the spreadsheet program itself.

The general rule to organizing data is to keep data of the same nature together. Usually, this means data with the same units and measured in the same way.

import pandas 
fn = '../../data/stats_data.xlsx'
dirty_df = pandas.read_excel(fn, 'dirty_sheet')
clean_df = pandas.read_excel(fn, 'clean_sheet')


clean_df
exposure_time_h dose_ug mouse day mouse_weight_g lung_weight_g
0 0 0 0 22 22.15 0.78
1 0 0 1 25 27.76 0.70
2 0 0 2 22 24.00 0.97
3 0 0 3 24 22.74 1.32
4 0 0 4 22 23.02 1.08
5 0 0 5 25 23.68 1.27
6 0 0 6 25 20.22 0.74
7 0 0 7 22 22.17 1.25
8 6 200 0 25 18.46 1.12
9 6 200 1 25 19.16 0.88
10 6 200 2 25 19.69 1.04
11 6 200 3 25 18.95 0.97
12 6 200 4 25 26.83 0.85
13 6 200 5 25 26.24 0.81
14 6 200 6 25 30.65 0.86
15 6 200 7 24 22.07 1.14
16 24 200 0 25 25.54 1.19
17 24 200 2 25 19.04 0.15
18 24 200 3 25 20.65 1.15
19 24 200 4 25 20.42 0.99
20 24 200 5 24 20.07 1.26
21 24 200 6 24 25.04 1.29
22 24 200 7 25 24.80 1.07
23 72 200 0 25 18.51 1.31
24 72 200 1 24 19.14 1.15
25 72 200 2 24 16.97 1.02
26 72 200 3 23 19.78 1.57
27 72 200 4 25 16.97 1.42
28 72 200 5 25 17.74 1.31
29 72 200 6 25 18.61 1.33
30 72 200 7 24 18.31 1.30
31 72 400 0 25 27.29 0.57
32 72 400 1 25 30.03 1.09
33 72 400 2 25 27.47 0.35
34 72 400 3 25 23.62 0.80
35 72 400 4 25 23.28 0.38
36 72 400 5 25 28.21 0.91
37 72 400 6 25 29.40 0.25
38 72 400 7 25 30.37 0.18

After reading clean_sheet into clean_df, we can easily workd with this

data

First we add another column, lung_weight_fraction which is the fraction of the mouse weight that is lung (including tumors):

clean_df['lung_weight_fraction'] = clean_df['lung_weight_g'] / clean_df['mouse_weight_g']

Next, we orgnaize the data into separate trials, with data from each trial having the same dose and exposure time. We could have done this in the spreadsheet, but this illustrates how to group data using Python.

# Here we go through all possible combinations of 
# exposure time and dose and only add data that we 
# find.   
exposure_times = set(clean_df['exposure_time_h'])
dosages = set(clean_df['dose_ug'])

trial_dfs = {}
for et in exposure_times:
    for dose in dosages:

        # We use the query method to extract all the data from one trial into its
        # own DataFrame.
        query_string = 'exposure_time_h == {} and dose_ug == {}'.format(et, dose)
        print('Executing query: ', query_string)
        trial_df = clean_df.query(query_string)

        if trial_df.empty:
            print('  No data found for query: ', query_string)
            continue

        exp_key = '{}_{}'.format(et, dose)
        if exp_key == '0_0':
            exp_key = 'control'

        print('  Data found. Adding trial_df under the experiment_key: ', exp_key)

        trial_dfs[exp_key] = trial_df

Executing query:  exposure_time_h == 0 and dose_ug == 0
  Data found. Adding trial_df under the experiment_key:  control
Executing query:  exposure_time_h == 0 and dose_ug == 200
  No data found for query:  exposure_time_h == 0 and dose_ug == 200
Executing query:  exposure_time_h == 0 and dose_ug == 400
  No data found for query:  exposure_time_h == 0 and dose_ug == 400
Executing query:  exposure_time_h == 24 and dose_ug == 0
  No data found for query:  exposure_time_h == 24 and dose_ug == 0
Executing query:  exposure_time_h == 24 and dose_ug == 200
  Data found. Adding trial_df under the experiment_key:  24_200
Executing query:  exposure_time_h == 24 and dose_ug == 400
  No data found for query:  exposure_time_h == 24 and dose_ug == 400
Executing query:  exposure_time_h == 72 and dose_ug == 0
  No data found for query:  exposure_time_h == 72 and dose_ug == 0
Executing query:  exposure_time_h == 72 and dose_ug == 200
  Data found. Adding trial_df under the experiment_key:  72_200
Executing query:  exposure_time_h == 72 and dose_ug == 400
  Data found. Adding trial_df under the experiment_key:  72_400
Executing query:  exposure_time_h == 6 and dose_ug == 0
  No data found for query:  exposure_time_h == 6 and dose_ug == 0
Executing query:  exposure_time_h == 6 and dose_ug == 200
  Data found. Adding trial_df under the experiment_key:  6_200
Executing query:  exposure_time_h == 6 and dose_ug == 400
  No data found for query:  exposure_time_h == 6 and dose_ug == 400

Now we make a box plot with one box for each trial.

%matplotlib inline
import matplotlib.pylab as plt

fig, ax1 = plt.subplots(figsize=(12, 4))

# We sort the DataFrames by their median values.
ordered_trials = sorted((df['lung_weight_fraction'].median(), exp_key) 
                         for exp_key, df in trial_dfs.items())

# We only keep the experimental keys.
labels = [exp_key for _, exp_key in ordered_trials]

# The boxplot method requires a list of lists, with each inner list 
# being the data from one trial.
X = [trial_dfs[exp_key]['lung_weight_fraction'] for exp_key in labels]

# The bootstrap keyword specifies the number of iterations used to 
# bootstrap the 95% confidance intervals for each box.
ax1.boxplot(X, bootstrap=5000)
ax1.set_ylabel('Fraction of weight that is lung.', fontsize=15)
ax1.set_xlabel('time(h)_dose($\mu$g) or control')
ax1.set_title('Lung tumor trials sorted by the median value of the fractional weight of lung.')

xtick_names = plt.setp(ax1, xticklabels=labels)
plt.setp(xtick_names, rotation=30)

plt.show()

png

Finally, we can calculate the p-values for rejecting the null hypothesis that the mean values between these trials are different.

import scipy.stats as stats

# We don't need to assume that the variances are the same.
control_and_first = stats.ttest_ind(X[0], X[1], equal_var=False)
control_and_last = stats.ttest_ind(X[1], X[-1], equal_var=False)

print('The p-value between control and trial 72_400 is: ', control_and_first[1])
print('The p-value between control and trial 72_200 is: ', control_and_last[1])

The p-value between control and trial 72_400 is:  0.0015037647302
The p-value between control and trial 72_200 is:  0.000135155817627

Conclusion:

Both 72h samples seem to have mean values that are different from control, but in opposite directions.