Data Tip: Processing Checkbox-Factors – Tinashe Michael Tapera

The other day I was tasked with processing some data that had been generated from a survey that included checkboxes. You know, the “tick all that apply”, kind of questions. Nothing against this style of response, it can be quite effective for collecting unbiased data. However, platforms such as Qualtrics annoyingly make it such that checkbox responses are output as singular concatenated strings per response!

Let’s take a look at what this might look like, using the Star Wars dataset in the most recent version of dplyr:

dplyr::starwars

# A tibble: 87 × 14
   name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
   <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
 1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
 2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
 3 R2-D2           96    32 <NA>    white,… red        33   none  mascu… Naboo  
 4 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
 5 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
 6 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
 7 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
 8 R5-D4           97    32 <NA>    white,… red        NA   none  mascu… Tatooi…
 9 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
10 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
# … with 77 more rows, 4 more variables: species <chr>, films <list>,
#   vehicles <list>, starships <list>, and abbreviated variable names
#   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

sw = dplyr::starwars

The variable skin_color is a good candidate for us to learn how to process this type of response. In this variable, Obi-Wan’s hair colour is listed as “auburn, white”, which would be like checking two boxes in a survey. In order to get, for example, a count of how many people listed “white” as their hair colour, how would we deal with this column?

You could do it in base R by enumerating all of the different strings and checking if the row contains any:

# one moment I will give it up to python,
# R does not have any base objects for dictionaries
# or set-like collections

my_colours = list() 

for(character in 1:nrow(sw)){
  
  current_colours = sw$skin_color[character]
  current_colours = trimws(unlist(strsplit(current_colours,split = ",")))
  
  my_colours = append(my_colours, current_colours)
}

my_colours = unique(unlist(my_colours))
my_colours

 [1] "fair"          "gold"          "white"         "blue"         
 [5] "light"         "red"           "unknown"       "green"        
 [9] "green-tan"     "brown"         "pale"          "metal"        
[13] "dark"          "brown mottle"  "grey"          "mottled green"
[17] "orange"        "yellow"        "tan"           "silver"       
[21] "none"

Now we have all of the possible values. To see which rows contain one of the values, we have to create an empty dataframe and iterate over the original:

# create the dataframe
sw_colours = data.frame(matrix(nrow = 0, ncol = length(my_colours)))
names(sw_colours) = my_colours

# iterate over the original data frame counting hits for each
for(character in 1:nrow(sw)){
  
  hits = sapply(names(sw_colours), function(x) grepl(x, sw$skin_color[character]), USE.NAMES = FALSE)
  sw_colours[character,] = hits
}

And there you have it, join these two by name:

sw_colours$name = sw$name
sw_with_checkboxes = merge(sw, sw_colours, by = "name")
head(sw_with_checkboxes[,c("name", names(sw_colours))])

                 name  fair  gold white  blue light   red unknown green
1              Ackbar FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE
2          Adi Gallia FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE
3    Anakin Skywalker  TRUE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE
4        Arvel Crynyd  TRUE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE
5         Ayla Secura FALSE FALSE FALSE  TRUE FALSE FALSE   FALSE FALSE
6 Bail Prestor Organa FALSE FALSE FALSE FALSE FALSE FALSE   FALSE FALSE
  green-tan brown  pale metal  dark brown mottle  grey mottled green orange
1     FALSE  TRUE FALSE FALSE FALSE         TRUE FALSE         FALSE  FALSE
2     FALSE FALSE FALSE FALSE  TRUE        FALSE FALSE         FALSE  FALSE
3     FALSE FALSE FALSE FALSE FALSE        FALSE FALSE         FALSE  FALSE
4     FALSE FALSE FALSE FALSE FALSE        FALSE FALSE         FALSE  FALSE
5     FALSE FALSE FALSE FALSE FALSE        FALSE FALSE         FALSE  FALSE
6     FALSE FALSE FALSE FALSE FALSE        FALSE FALSE         FALSE  FALSE
  yellow   tan silver  none              name.1
1  FALSE FALSE  FALSE FALSE              Ackbar
2  FALSE FALSE  FALSE FALSE          Adi Gallia
3  FALSE FALSE  FALSE FALSE    Anakin Skywalker
4  FALSE FALSE  FALSE FALSE        Arvel Crynyd
5  FALSE FALSE  FALSE FALSE         Ayla Secura
6  FALSE  TRUE  FALSE FALSE Bail Prestor Organa

Admittedly, this is all quite long-winded and could even have been done better. But fortunately, someone has already tackled this problem, and the solution is available on CRAN: The splitstackshape package.

This package has handy function concat.split() and variants for our case:

sw_expanded=concat.split(sw, "skin_color", structure="expanded", type="character", fill = 0)
head(select(sw_expanded, name, matches("skin_color")))

            name  skin_color skin_color_blue skin_color_brown
1 Luke Skywalker        fair               0                0
2          C-3PO        gold               0                0
3          R2-D2 white, blue               1                0
4    Darth Vader       white               0                0
5    Leia Organa       light               0                0
6      Owen Lars       light               0                0
  skin_color_brown mottle skin_color_dark skin_color_fair skin_color_gold
1                       0               0               1               0
2                       0               0               0               1
3                       0               0               0               0
4                       0               0               0               0
5                       0               0               0               0
6                       0               0               0               0
  skin_color_green skin_color_green-tan skin_color_grey skin_color_light
1                0                    0               0                0
2                0                    0               0                0
3                0                    0               0                0
4                0                    0               0                0
5                0                    0               0                1
6                0                    0               0                1
  skin_color_metal skin_color_mottled green skin_color_none skin_color_orange
1                0                        0               0                 0
2                0                        0               0                 0
3                0                        0               0                 0
4                0                        0               0                 0
5                0                        0               0                 0
6                0                        0               0                 0
  skin_color_pale skin_color_red skin_color_silver skin_color_tan
1               0              0                 0              0
2               0              0                 0              0
3               0              0                 0              0
4               0              0                 0              0
5               0              0                 0              0
6               0              0                 0              0
  skin_color_unknown skin_color_white skin_color_yellow
1                  0                0                 0
2                  0                0                 0
3                  0                1                 0
4                  0                1                 0
5                  0                0                 0
6                  0                0                 0

Easy peasy!