How to organize your data for use in Python: Boxplot example with statistical calculations.
Download original file: 9_Organizing_your_data_and_boxplot.ipynb
View original file in nbviewer: 9_Organizing_your_data_and_boxplot.ipynb
Organizing your data
Because spreadsheets are very flexable, people use them in very different ways and often organize their data in a form that is easy for humans to understand, but non-trivial for computers to parse.
We will take two sheets from one spreadsheet (stats_data.xlsx
)
to illustrate this point (dirty_sheet
and clean_sheet
).
The sheet we call clean_sheet
is a reorganized form of the
original data in dirty_sheet
. If you open up this file in
a spreadsheet program, you will note that the original user
had re-organized this data to plot it, so the original form
was not even a good choice for use by the spreadsheet program itself.
The general rule to organizing data is to keep data of the same nature together. Usually, this means data with the same units and measured in the same way.
import pandas
fn = '../../data/stats_data.xlsx'
dirty_df = pandas.read_excel(fn, 'dirty_sheet')
clean_df = pandas.read_excel(fn, 'clean_sheet')
clean_df
exposure_time_h | dose_ug | mouse | day | mouse_weight_g | lung_weight_g | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 22 | 22.15 | 0.78 |
1 | 0 | 0 | 1 | 25 | 27.76 | 0.70 |
2 | 0 | 0 | 2 | 22 | 24.00 | 0.97 |
3 | 0 | 0 | 3 | 24 | 22.74 | 1.32 |
4 | 0 | 0 | 4 | 22 | 23.02 | 1.08 |
5 | 0 | 0 | 5 | 25 | 23.68 | 1.27 |
6 | 0 | 0 | 6 | 25 | 20.22 | 0.74 |
7 | 0 | 0 | 7 | 22 | 22.17 | 1.25 |
8 | 6 | 200 | 0 | 25 | 18.46 | 1.12 |
9 | 6 | 200 | 1 | 25 | 19.16 | 0.88 |
10 | 6 | 200 | 2 | 25 | 19.69 | 1.04 |
11 | 6 | 200 | 3 | 25 | 18.95 | 0.97 |
12 | 6 | 200 | 4 | 25 | 26.83 | 0.85 |
13 | 6 | 200 | 5 | 25 | 26.24 | 0.81 |
14 | 6 | 200 | 6 | 25 | 30.65 | 0.86 |
15 | 6 | 200 | 7 | 24 | 22.07 | 1.14 |
16 | 24 | 200 | 0 | 25 | 25.54 | 1.19 |
17 | 24 | 200 | 2 | 25 | 19.04 | 0.15 |
18 | 24 | 200 | 3 | 25 | 20.65 | 1.15 |
19 | 24 | 200 | 4 | 25 | 20.42 | 0.99 |
20 | 24 | 200 | 5 | 24 | 20.07 | 1.26 |
21 | 24 | 200 | 6 | 24 | 25.04 | 1.29 |
22 | 24 | 200 | 7 | 25 | 24.80 | 1.07 |
23 | 72 | 200 | 0 | 25 | 18.51 | 1.31 |
24 | 72 | 200 | 1 | 24 | 19.14 | 1.15 |
25 | 72 | 200 | 2 | 24 | 16.97 | 1.02 |
26 | 72 | 200 | 3 | 23 | 19.78 | 1.57 |
27 | 72 | 200 | 4 | 25 | 16.97 | 1.42 |
28 | 72 | 200 | 5 | 25 | 17.74 | 1.31 |
29 | 72 | 200 | 6 | 25 | 18.61 | 1.33 |
30 | 72 | 200 | 7 | 24 | 18.31 | 1.30 |
31 | 72 | 400 | 0 | 25 | 27.29 | 0.57 |
32 | 72 | 400 | 1 | 25 | 30.03 | 1.09 |
33 | 72 | 400 | 2 | 25 | 27.47 | 0.35 |
34 | 72 | 400 | 3 | 25 | 23.62 | 0.80 |
35 | 72 | 400 | 4 | 25 | 23.28 | 0.38 |
36 | 72 | 400 | 5 | 25 | 28.21 | 0.91 |
37 | 72 | 400 | 6 | 25 | 29.40 | 0.25 |
38 | 72 | 400 | 7 | 25 | 30.37 | 0.18 |
After reading clean_sheet
into clean_df
, we can easily workd with this
data
First we add another column, lung_weight_fraction
which is the fraction of the
mouse weight that is lung (including tumors):
clean_df['lung_weight_fraction'] = clean_df['lung_weight_g'] / clean_df['mouse_weight_g']
Next, we orgnaize the data into separate trials, with data from each trial
having the same dose and exposure time. We could have done this in the
spreadsheet, but this illustrates how to group data using Python
.
# Here we go through all possible combinations of
# exposure time and dose and only add data that we
# find.
exposure_times = set(clean_df['exposure_time_h'])
dosages = set(clean_df['dose_ug'])
trial_dfs = {}
for et in exposure_times:
for dose in dosages:
# We use the query method to extract all the data from one trial into its
# own DataFrame.
query_string = 'exposure_time_h == {} and dose_ug == {}'.format(et, dose)
print('Executing query: ', query_string)
trial_df = clean_df.query(query_string)
if trial_df.empty:
print(' No data found for query: ', query_string)
continue
exp_key = '{}_{}'.format(et, dose)
if exp_key == '0_0':
exp_key = 'control'
print(' Data found. Adding trial_df under the experiment_key: ', exp_key)
trial_dfs[exp_key] = trial_df
Executing query: exposure_time_h == 0 and dose_ug == 0
Data found. Adding trial_df under the experiment_key: control
Executing query: exposure_time_h == 0 and dose_ug == 200
No data found for query: exposure_time_h == 0 and dose_ug == 200
Executing query: exposure_time_h == 0 and dose_ug == 400
No data found for query: exposure_time_h == 0 and dose_ug == 400
Executing query: exposure_time_h == 24 and dose_ug == 0
No data found for query: exposure_time_h == 24 and dose_ug == 0
Executing query: exposure_time_h == 24 and dose_ug == 200
Data found. Adding trial_df under the experiment_key: 24_200
Executing query: exposure_time_h == 24 and dose_ug == 400
No data found for query: exposure_time_h == 24 and dose_ug == 400
Executing query: exposure_time_h == 72 and dose_ug == 0
No data found for query: exposure_time_h == 72 and dose_ug == 0
Executing query: exposure_time_h == 72 and dose_ug == 200
Data found. Adding trial_df under the experiment_key: 72_200
Executing query: exposure_time_h == 72 and dose_ug == 400
Data found. Adding trial_df under the experiment_key: 72_400
Executing query: exposure_time_h == 6 and dose_ug == 0
No data found for query: exposure_time_h == 6 and dose_ug == 0
Executing query: exposure_time_h == 6 and dose_ug == 200
Data found. Adding trial_df under the experiment_key: 6_200
Executing query: exposure_time_h == 6 and dose_ug == 400
No data found for query: exposure_time_h == 6 and dose_ug == 400
Now we make a box plot with one box for each trial.
%matplotlib inline
import matplotlib.pylab as plt
fig, ax1 = plt.subplots(figsize=(12, 4))
# We sort the DataFrames by their median values.
ordered_trials = sorted((df['lung_weight_fraction'].median(), exp_key)
for exp_key, df in trial_dfs.items())
# We only keep the experimental keys.
labels = [exp_key for _, exp_key in ordered_trials]
# The boxplot method requires a list of lists, with each inner list
# being the data from one trial.
X = [trial_dfs[exp_key]['lung_weight_fraction'] for exp_key in labels]
# The bootstrap keyword specifies the number of iterations used to
# bootstrap the 95% confidance intervals for each box.
ax1.boxplot(X, bootstrap=5000)
ax1.set_ylabel('Fraction of weight that is lung.', fontsize=15)
ax1.set_xlabel('time(h)_dose($\mu$g) or control')
ax1.set_title('Lung tumor trials sorted by the median value of the fractional weight of lung.')
xtick_names = plt.setp(ax1, xticklabels=labels)
plt.setp(xtick_names, rotation=30)
plt.show()
Finally, we can calculate the p-values for rejecting the null hypothesis that the mean values between these trials are different.
import scipy.stats as stats
# We don't need to assume that the variances are the same.
control_and_first = stats.ttest_ind(X[0], X[1], equal_var=False)
control_and_last = stats.ttest_ind(X[1], X[-1], equal_var=False)
print('The p-value between control and trial 72_400 is: ', control_and_first[1])
print('The p-value between control and trial 72_200 is: ', control_and_last[1])
The p-value between control and trial 72_400 is: 0.0015037647302
The p-value between control and trial 72_200 is: 0.000135155817627
Conclusion:
Both 72h samples seem to have mean values that are different from control, but in opposite directions.