Chapter 14 Reproducible Data Munging in R
library(datapasta)
library(tidyverse)
library(cowplot)
library(viridis)
Data munging is the process of taking data from one source(s) and working it into a condition where it can be analyzed. Every data set will differ and so every data munge will be custom.
Data munging is a bit like organic chemistry. We know what we want to create. We know the starting materials that we have on hand. We work work with reactions and intermediates necessary to produce that final product.
Here is one example of the munging process. The exercise converts an excel file data source into a format that can be used in R.
The data source is formatted in such a way that reading the file is buggy, and not pragmatic. We will use the import function (datapasta
) and a handful of functions in the tidyverse
package to get it into R and into a long table tidy format.
Note especially how every transaction with the data is recorded. Anyone who has this source data file can start with the same excel file and get an identical outcome. That’s reproducibility.
14.1 Jaxwest7 data
The Jaxwest7 data set is a Jackson Labs experiment conducted using an obese mouse model of diabetes type 2. The experiment tests whether a drug used to treat type2 diabetes, rosiglitazone, is effective in the model.
From the protocol, half the subjects receive the antidiabetic drug, the other receive vehicle as placebo. The syndrome is assessed by measuring two response variables: body weight and blood glucose concentrations. There are two explanatory variables: day of study and drug treatment.
The experimental design is therefore multivariate (weight, blood glucose) two-factor (drug treatment, day) ANOVA with repeated measures on the day variable.
Our ultimate goal is to create a dataset for MANOVA analysis.
The purpose of this chapter, for now, is to illustrate how to retrieve and process data to prepare it for analysis. We will use datapasta
to copy the data from the spreadsheet into code, which illustrates the Addin feature of RStudio.
14.1.1 Inspect the source data
Download the Jaxwest7.xls file from the mouse phenome database to your machine and open it with spreadsheet software such as Excel.
First go to the BloodGlucoseGroups tab.
This is readable file, but complex. In fact, this sheet illustrates what unstructured data looks like.
- Almost every column has cells containing multiple types of values.
- The first 8 rows have various descriptor text, including a logo.
- Rows 9-14 have some other definitions.
- Scroll way over to the right and some graphs pop up.
- The data we are interested in are in rows 15 to 42, and in columns F to S. Each of those columns has two column names, a date and a day. A variable column should have only one name.
Additionally…
Columns T and U have several missing values, because those animals were used for autopsy. We’re going to have to ignore their response values.
Cell F21 is a character value indicative of an out-of-range test result.
Columns 43 to 146 are missing entirely. Below the array are some summary statistics, each of which is a different parameter.
Finally, cage and mouse ID’s are not recorded.
Now go to the BodyWeightsGroups tab.
The structural issues are about the same as for the BloodGlucoseGroups sheet. Here there are cage ID’s but no mouse ID’s.
Most notably, there are only 7 rows of body weights in the rosiglitazone treatment group, whereas there are 8 rows in the corresponding glucose group.
These are not a spreadsheets that can be imported whole scale directly into R with ease. Instead, we need to grab only the data we need. Then we’ll use R to structure it for analysis.
14.2 Munge the glucose data into R
Let’s start with the glucose data R. The goal is to create a dataframe object with the following variables: animal id, day, treatment, and glucose value.
Glucose concentrations were measured twice per day on odd-numbered days plus day 12. Each column represents a blood draw session. This was done on each of 16 animals. Half were in a placebo group, half were in a drug group.
We’ll omit day 15 due to the NA values (those mice were harvested for autopsy, and so day 15 breaks the time series).
We’ll also omit the last row in the rosiglitazone group because it doesn’t have a corresponding match in the body weight data.
We assume each row represents a unique mouse and that the rows in the glucose and body weight data correspond. This is tenuous and not ideal.
14.2.0.1 Step 1
Deal with cell F21. It’s value in the excel spreadsheet is “Hi,” a character value rather than a numeric. We have two options: Assign it an NA value, or impute.
Since this is a related-measures time series with multiple other glucose measurements for that specific replicate, we’ll impute by using the average of all these other measurements.
Calculate the value that will be imputed:
#Use datapasta to paste in vector values. Calculate their mean. Then impute value for cell F21 in original data set by exchanging the value "Hi" with the mean produced here.
<- mean(c(449L, 525L, 419L, 437L, 476L, 525L, 499L, 516L, 485L, 472L, 535L, 500L, 497L)
F21 ); F21
## [1] 487.3077
Ideally, you’d import the data with the “Hi” value and fix it in R, to have a contiguous reproducible record for the imputation. In this case, I’m getting some anomalies using read_excel
from readr
, which seem related to the spreadsheet formatting. Leaving “Hi” in F21 causes some another issue with the datapasta
importer that require additional not-fun munging. So it will be fixed in situ.
14.2.0.2 Step 2
Fix the F21 cell in the spreadsheet file by imputing the value from above.
14.2.0.3 Step 3
Copy the spreadsheet array F15:S41 to the clipboard. This is 14 columns and 15 rows of glucose data. All values are numeric and represent the same variable: blood glucose concentration.
This omits the last replicate from the rosiglitazone glucose group. This is due to the fact that the body weight data only are for four rosiglitazone animals.
Use the datapasta
package Addin for this procedure. Create an object name, put the cursor next to it, and select Paste as Tribble
from the Addins
drop down menu. You’ll find the Addins
drop down just below the RStudio main menu.
<- tibble::tribble(
jw7gluc ~V1, ~V2, ~V3, ~V4, ~V5, ~V6, ~V7, ~V8, ~V9, ~V10, ~V11, ~V12, ~V13, ~V14,
136, 270, 162, 165, 192, 397, 172, 148, 291, 239, 192, 172, 235, 153,
345, 518, 429, 413, 456, 487, 468, 419, 507, 559, 420, 415, 511, 464,
190, 301, 311, 361, 398, 465, 388, 392, 453, 421, 355, 381, 394, 444,
434, 504, 453, 392, 350, 400, 458, 387, 342, 368, 355, 429, 373, 501,
424, 486, 447, 417, 496, 484, 468, 423, 472, 507, 458, 456, 519, 570,
170, 208, 134, 129, 147, 141, 241, 128, 162, 163, 222, 438, 307, 252,
487, 449, 525, 419, 437, 476, 525, 499, 516, 485, 472, 535, 500, 497,
218, 273, 254, 265, 338, 386, 287, 236, 347, 235, 432, 450, 509, 326,
179, 184, 124, 107, 108, 149, 142, 143, 112, 233, 113, 137, 106, 150,
260, 381, 174, 140, 132, 138, 164, 137, 122, 140, 102, 174, 120, 135,
115, 191, 132, 132, 169, 158, 129, 120, 122, 157, 94, 141, 120, 166,
526, 517, 465, 394, 310, 269, 213, 185, 145, 201, 131, 258, 114, 160,
325, 252, 203, 158, 135, 162, 164, 181, 150, 177, 162, 192, 170, 162,
329, 296, 212, 159, 156, 200, 139, 143, 164, 150, 119, 193, 148, 188,
230, 414, 408, 179, 432, 288, 163, 240, 185, 208, 138, 208, 153, 140
)
Notice how R coerces unique variable names for each column. That’s fine, but we’ll need to fix them.
14.2.0.4 Step 4
Convert from a wide to a long format, and check.
<- jw7gluc %>% pivot_longer(cols=V1:V14,
gluc names_to = "V",
values_to="glucose")
gluc
## # A tibble: 210 x 2
## V glucose
## <chr> <dbl>
## 1 V1 136
## 2 V2 270
## 3 V3 162
## 4 V4 165
## 5 V5 192
## 6 V6 397
## 7 V7 172
## 8 V8 148
## 9 V9 291
## 10 V10 239
## # ... with 200 more rows
14.2.0.5 Step 5
Create variables for ID, day, blood draw and treatment.
<- rep(LETTERS[1:15], each=14)
id <- rep(rep(c(1, 3, 5, 7, 9, 11, 12), each=2), 15)
day <- rep(rep(c("early", "late"), 7),15)
draw <- c(rep("placebo", 8*14 ), rep("rosiglitazone", 7*14)) treat
14.2.0.6 Step 6
Add the variables to the long data frame while removing the irrelevant “V” variable.
<- add_column(gluc,
gluc
id,
day,
draw,
treat,.before=T) %>%
select(-one_of("V"))
gluc
## # A tibble: 210 x 5
## id day draw treat glucose
## <chr> <dbl> <chr> <chr> <dbl>
## 1 A 1 early placebo 136
## 2 A 1 late placebo 270
## 3 A 3 early placebo 162
## 4 A 3 late placebo 165
## 5 A 5 early placebo 192
## 6 A 5 late placebo 397
## 7 A 7 early placebo 172
## 8 A 7 late placebo 148
## 9 A 9 early placebo 291
## 10 A 9 late placebo 239
## # ... with 200 more rows
14.2.0.7 Step 7
Convert every variable except for glucose to a factor.
<- c("id", "day", "draw", "treat")
cols <- lapply(gluc[cols], factor)
gluc[cols] gluc
## # A tibble: 210 x 5
## id day draw treat glucose
## <fct> <fct> <fct> <fct> <dbl>
## 1 A 1 early placebo 136
## 2 A 1 late placebo 270
## 3 A 3 early placebo 162
## 4 A 3 late placebo 165
## 5 A 5 early placebo 192
## 6 A 5 late placebo 397
## 7 A 7 early placebo 172
## 8 A 7 late placebo 148
## 9 A 9 early placebo 291
## 10 A 9 late placebo 239
## # ... with 200 more rows
14.2.0.8 Step 8
Average the early and late blood draws, to get one glucose value per day.
<- gluc %>%
gluc group_by(id, day, treat) %>%
summarise(glucose=mean(glucose))
## `summarise()` has grouped output by 'id', 'day'. You can override using the `.groups` argument.
gluc
## # A tibble: 105 x 4
## # Groups: id, day [105]
## id day treat glucose
## <fct> <fct> <fct> <dbl>
## 1 A 1 placebo 203
## 2 A 3 placebo 164.
## 3 A 5 placebo 294.
## 4 A 7 placebo 160
## 5 A 9 placebo 265
## 6 A 11 placebo 182
## 7 A 12 placebo 194
## 8 B 1 placebo 432.
## 9 B 3 placebo 421
## 10 B 5 placebo 472.
## # ... with 95 more rows
14.2.0.9 Step 9
Now here’s a summary of all the replicate values in tabular form.
%>%
gluc group_by(day, treat) %>%
summarise(
n=length(glucose),
mean=round(mean(glucose)),
sd=round(sd(glucose)),
sem=round(sd/sqrt(n)),
min=round(min(glucose)),
max=round(max(glucose))
)
## `summarise()` has grouped output by 'day'. You can override using the `.groups` argument.
## # A tibble: 14 x 8
## # Groups: day [7]
## day treat n mean sd sem min max
## <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 placebo 8 338 128 45 189 469
## 2 1 rosiglitazone 7 300 120 45 153 522
## 3 3 placebo 8 330 131 46 132 472
## 4 3 rosiglitazone 7 213 111 42 116 430
## 5 5 placebo 8 378 115 41 144 490
## 6 5 rosiglitazone 7 200 89 34 128 360
## 7 7 placebo 8 352 132 47 160 512
## 8 7 rosiglitazone 7 162 30 11 124 202
## 9 9 placebo 8 379 132 47 162 533
## 10 9 rosiglitazone 7 162 22 8 131 196
## 11 11 placebo 8 386 99 35 182 504
## 12 11 rosiglitazone 7 154 29 11 118 194
## 13 12 placebo 8 410 117 41 194 544
## 14 12 rosiglitazone 7 145 17 6 128 168
Woot! We’ll visualize at the end.
14.3 Munge the bodyweight data
The goal here is to create a data frame of the body weight data that is symmetric to the glucose dataframe created above. That’s because the ultimate goal is to join the two together into a single dataframe.
The major difference between the two sheets in the excel file is the body weights are measured daily rather than every other day as for glucose. We’ll therefore toss out some data. Sad.
14.3.0.1 Step 1
From the BodyWeightsGroups sheet we copy cells F15:T45 to the clipboard. Name an object in the code junk. Then on the Addins drop down menu select Paste as tribble
<- tibble::tribble(
jw7bw ~V1, ~V2, ~V3, ~V4, ~V5, ~V6, ~V7, ~V8, ~V9, ~V10, ~V11, ~V12, ~V13, ~V14, ~V15,
31.9, 32.1, 32.8, 33, 33.3, 33.2, 33.2, 32.8, 33.5, 34, 34.2, 34.9, 35.2, 35.8, 36.1,
38.1, 38.4, 38.7, 38.5, 38.8, 38.8, 38.9, 38.6, 39.4, 39.4, 39, 39.4, 39.7, 39.3, 39.1,
31, 31.7, 31.9, 32, 32.9, 33.1, 33.5, 33.6, 34.3, 34.4, 34.7, 35.5, 35.4, 35.5, 35.4,
36.4, 35.9, 36.8, 36.9, 37.3, 37.1, 37.4, 37.4, 36.8, 37.4, 37.1, 38, 37.4, 38, 37.9,
38.9, 38.5, 39.2, 39.1, 39.8, 39.3, 39.3, 39.5, 39.7, 40.1, 40.3, 41, 40.9, 41.4, 41.4,
33.8, 34.2, 34.2, 33.9, 34.7, 35, 35.3, 35.3, 35.8, 36.1, 36.7, 37.3, 37.5, 38.4, 38.7,
34.6, 35, 35, 35.1, 35.6, 35.5, 35.7, 36.1, 35.9, 35.9, 35.5, 35.3, 35.2, 35, 34.6,
33.4, 33.6, 34.2, 33.8, 34.3, 34.6, 34.7, 34.9, 34.9, 35.4, 35.6, 35.8, 36.1, 35.9, 35.8,
31.9, 31.9, 32.6, 33.2, 34.3, 34.7, 35.3, 35.1, 35.4, 35.6, 36, 36.9, 37, 38.7, 38.8,
33.4, 34.1, 35.2, 35.8, 36.7, 37.4, 38.2, 38.7, 39.5, 40.1, 40.2, 40.6, 41, 41.7, 42,
32.8, 33.7, 34.5, 34.9, 35.5, 36, 36, 36.2, 36, 36.9, 36.9, 37.6, 38.1, 39.2, 39.5,
37.9, 38.8, 39.9, 40.4, 41.5, 42.5, 43.3, 43.8, 44.5, 44.7, 45.1, 45.7, 46.3, 47.1, 47.1,
35.9, 37.2, 38, 38.6, 39.4, 40, 40.5, 41, 41.2, 42.3, 43.2, 43.9, 44.2, 44.3, 44.9,
35.2, 36.5, 37.4, 38, 39, 39.5, 40.4, 40.7, 40.9, 41.9, 42.5, 43.5, 44, 44.1, 44.7,
35.8, 36.6, 37.3, 37.7, 38.4, 38.4, 39.3, 40.1, 40.9, 41.4, 41.7, 42.3, 42.2, 42.9, 43.3
)
14.3.0.2 Step 2
Select only the days that correspond to the glucose measurement days.
<- jw7bw %>%
bw select(c(V1, V3, V5, V7, V9, V11, V12))
bw
## # A tibble: 15 x 7
## V1 V3 V5 V7 V9 V11 V12
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 31.9 32.8 33.3 33.2 33.5 34.2 34.9
## 2 38.1 38.7 38.8 38.9 39.4 39 39.4
## 3 31 31.9 32.9 33.5 34.3 34.7 35.5
## 4 36.4 36.8 37.3 37.4 36.8 37.1 38
## 5 38.9 39.2 39.8 39.3 39.7 40.3 41
## 6 33.8 34.2 34.7 35.3 35.8 36.7 37.3
## 7 34.6 35 35.6 35.7 35.9 35.5 35.3
## 8 33.4 34.2 34.3 34.7 34.9 35.6 35.8
## 9 31.9 32.6 34.3 35.3 35.4 36 36.9
## 10 33.4 35.2 36.7 38.2 39.5 40.2 40.6
## 11 32.8 34.5 35.5 36 36 36.9 37.6
## 12 37.9 39.9 41.5 43.3 44.5 45.1 45.7
## 13 35.9 38 39.4 40.5 41.2 43.2 43.9
## 14 35.2 37.4 39 40.4 40.9 42.5 43.5
## 15 35.8 37.3 38.4 39.3 40.9 41.7 42.3
14.3.0.3 Step 3
Change from wide to long format and check.
<- bw %>% pivot_longer(cols=V1:V12, names_to="V", values_to="weight")
bw bw
## # A tibble: 105 x 2
## V weight
## <chr> <dbl>
## 1 V1 31.9
## 2 V3 32.8
## 3 V5 33.3
## 4 V7 33.2
## 5 V9 33.5
## 6 V11 34.2
## 7 V12 34.9
## 8 V1 38.1
## 9 V3 38.7
## 10 V5 38.8
## # ... with 95 more rows
14.3.0.4 Step 4
Add variables for id, day and treatment.
<- rep(LETTERS[1:15], each=7)
id
<- rep(c(1, 3, 5, 7, 9, 11, 12), 15)
day
<- c(rep("placebo", 8*7), rep("rosiglitazone", 7*7))
treat
<- add_column(bw, id, day, treat, .before=T) %>% select(-one_of("V"))
bw bw
## # A tibble: 105 x 4
## id day treat weight
## <chr> <dbl> <chr> <dbl>
## 1 A 1 placebo 31.9
## 2 A 3 placebo 32.8
## 3 A 5 placebo 33.3
## 4 A 7 placebo 33.2
## 5 A 9 placebo 33.5
## 6 A 11 placebo 34.2
## 7 A 12 placebo 34.9
## 8 B 1 placebo 38.1
## 9 B 3 placebo 38.7
## 10 B 5 placebo 38.8
## # ... with 95 more rows
14.3.0.5 Step 5
Factorize each of the variables except for weight.
<- c("id", "day", "treat")
cols
<- lapply(bw[cols], factor)
bw[cols] bw
## # A tibble: 105 x 4
## id day treat weight
## <fct> <fct> <fct> <dbl>
## 1 A 1 placebo 31.9
## 2 A 3 placebo 32.8
## 3 A 5 placebo 33.3
## 4 A 7 placebo 33.2
## 5 A 9 placebo 33.5
## 6 A 11 placebo 34.2
## 7 A 12 placebo 34.9
## 8 B 1 placebo 38.1
## 9 B 3 placebo 38.7
## 10 B 5 placebo 38.8
## # ... with 95 more rows
14.3.0.6 Step 6
Calculate some descriptive statistics.
%>%
bw group_by(day, treat) %>%
summarise(
n=length(weight),
mean=mean(weight),
sd=sd(weight),
sem=sd/sqrt(n),
min=min(weight),
max=max(weight)
)
## `summarise()` has grouped output by 'day'. You can override using the `.groups` argument.
## # A tibble: 14 x 8
## # Groups: day [7]
## day treat n mean sd sem min max
## <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 placebo 8 34.8 2.83 1.00 31 38.9
## 2 1 rosiglitazone 7 34.7 2.09 0.791 31.9 37.9
## 3 3 placebo 8 35.4 2.65 0.938 31.9 39.2
## 4 3 rosiglitazone 7 36.4 2.45 0.927 32.6 39.9
## 5 5 placebo 8 35.8 2.55 0.900 32.9 39.8
## 6 5 rosiglitazone 7 37.8 2.48 0.936 34.3 41.5
## 7 7 placebo 8 36 2.32 0.820 33.2 39.3
## 8 7 rosiglitazone 7 39 2.77 1.05 35.3 43.3
## 9 9 placebo 8 36.3 2.26 0.798 33.5 39.7
## 10 9 rosiglitazone 7 39.8 3.17 1.20 35.4 44.5
## 11 11 placebo 8 36.6 2.11 0.747 34.2 40.3
## 12 11 rosiglitazone 7 40.8 3.33 1.26 36 45.1
## 13 12 placebo 8 37.2 2.19 0.775 34.9 41
## 14 12 rosiglitazone 7 41.5 3.30 1.25 36.9 45.7
14.4 Merge and explore the glucose and bw data
Joining the two data frames is so simple it is almost stupid easy.
<- left_join(bw, gluc) jw7
## Joining, by = c("id", "day", "treat")
jw7
## # A tibble: 105 x 5
## id day treat weight glucose
## <fct> <fct> <fct> <dbl> <dbl>
## 1 A 1 placebo 31.9 203
## 2 A 3 placebo 32.8 164.
## 3 A 5 placebo 33.3 294.
## 4 A 7 placebo 33.2 160
## 5 A 9 placebo 33.5 265
## 6 A 11 placebo 34.2 182
## 7 A 12 placebo 34.9 194
## 8 B 1 placebo 38.1 432.
## 9 B 3 placebo 38.7 421
## 10 B 5 placebo 38.8 472.
## # ... with 95 more rows
And now for the visualizations.
<- ggplot(jw7, aes(day, glucose, color=treat, group=id))+
p1 geom_point()+geom_line()+
theme_half_open(12) +
theme(plot.margin = margin(6, 0, 6, 0), legend.position="top")+
scale_color_viridis_d(begin=0, end=0.8)
<- ggplot(jw7, aes(day, weight, color=treat, group=id))+
p2 geom_point()+geom_line()+
theme_half_open(12) +
theme(plot.margin = margin(6, 0, 6, 0), legend.position="top")+
scale_color_viridis_d(begin=0, end=0.8)
plot_grid(p1,p2)
<- ggplot(jw7)+
p3 geom_histogram(aes(glucose, fill=treat))+
theme_half_open(12) +
theme(plot.margin = margin(6, 0, 6, 0), legend.position="top")+
scale_fill_viridis_d(begin=0, end=0.8)
<- ggplot(jw7)+
p4 geom_histogram(aes(weight, fill=treat))+
theme_half_open(12) +
theme(plot.margin = margin(6, 0, 6, 0), legend.position="top")+
scale_fill_viridis_d(begin=0, end=0.8)
plot_grid(p3,p4)
<- ggplot(jw7, aes(day, glucose,color=treat)) +
p5 stat_summary(fun.data = "mean_sdl",
fun.args = list(mult = 1),
geom ="pointrange") +
stat_summary(fun.y = mean,
geom = "line",
aes(group=treat)
+
) theme_half_open(12) +
theme(plot.margin = margin(6, 0, 6, 0), legend.position="top")+
scale_color_viridis_d(begin=0, end=0.8)
## Warning: `fun.y` is deprecated. Use `fun` instead.
<- ggplot(jw7, aes(day, weight, color=treat)) +
p6 stat_summary(fun.data = "mean_sdl",
fun.args = list(mult = 1),
geom ="pointrange") +
stat_summary(fun.y = mean,
geom = "line",
aes(group=treat)
+
) theme_half_open(12) +
theme(plot.margin = margin(6, 0, 6, 0), legend.position="top")+
scale_color_viridis_d(begin=0, end=0.8)
## Warning: `fun.y` is deprecated. Use `fun` instead.
plot_grid(p5,p6)
## Warning: Computation failed in `stat_summary()`:
## Can't convert a double vector to function
## Warning: Computation failed in `stat_summary()`:
## Can't convert a double vector to function
14.5 Summary
It is important to have a final product in mind before starting a munge. In this case, the goal was to create a dataframe for multivariate statistical analysis. These analyses need for each dependent variable to share a common set of independent variables. That product is the jw7
object above.
* Many data sets are like the Jaxwest7, with plenty of good information but unstructured.
* datapasta
is your friend.
* Every munge is a custom munge.