Tinashe Michael Tapera - Managing Expected Loop Failures with Purrr

library(dplyr)
library(purrr)

set.seed(12345)

Looping is a fundamental programming paradigm. You have a set of inputs, and you wanna run a function on each of them:

inputs <- sample(1:100, 10)

add_ten <- function(x) {
  return(x + 10)
}

# base R looping
outputs <- c()
for(x in inputs){
  outputs <- c(outputs, (add_ten(x)))
}
# \loop

print(glue::glue("{inputs} -> {outputs}"))

14 -> 24
51 -> 61
80 -> 90
90 -> 100
92 -> 102
24 -> 34
58 -> 68
93 -> 103
75 -> 85
88 -> 98

With the purrr library, we get the same functionality as looping¹ but with an arguably friendlier interface and more compliant mechanics with the idiosyncracies of the tidyverse:

# much nicer!
outputs <- map(inputs, add_ten)
print(glue::glue("{inputs} -> {outputs}"))

14 -> 24
51 -> 61
80 -> 90
90 -> 100
92 -> 102
24 -> 34
58 -> 68
93 -> 103
75 -> 85
88 -> 98

This is all fine and dandy, but let’s say you get a failure from the data, like, add_ten throws an error if the output is greater than 100:

add_ten <- function(x) {
  output <- x + 10
  if(output > 100){
    stop("The output is too great!")
  }
  return(output)
}

In a for loop, this fails as expected:

outputs <- c()
for(x in inputs){
  outputs <- c(outputs, (add_ten(x)))
}

Error in add_ten(x): The output is too great!

If I had to debug it this code, I would probably set up an iterator:

outputs <- c()
for(x in 1:length(inputs)){
  print(x)
  outputs <- c(outputs, (add_ten(inputs[x])))
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

Error in add_ten(inputs[x]): The output is too great!

It failed at 5, so I’ll check inputs[5] and debug:

inputs[5]

[1] 92

# the output would be greater than 100! Duh!!

But with purrr::map(), there isn’t a straightforward way to debug like this². And if you have a large dataset, with a long-running function, you probably don’t want to wait until the map call fails and you have to go digging around into exactly which object in the vector had the problem.

Enter: safely() and possibly().

These are two functions that modify the behaviour of a purrr call. You can wrap your function in one of these, and purrr will give you a couple of ways of managing what happens if and when your loop fails or throws some kind of warning or unexpected output. Here’s an example with the add_ten function, using quietly() to force map to keep going even if there’s a failure:

add_ten <- function(x) {
  output <- x + 10
  if(output > 100){
    stop("The output is too great!")
  }
  return(output)
}

add_ten_safely <- safely(add_ten)

out <- add_ten_safely(10)
out

$result
[1] 20

$error
NULL

out <- add_ten_safely(95)
out

$result
NULL

$error
<simpleError in .f(...): The output is too great!>

Here, we see that safely() returns a list of outputs, with result and error. Implementing this in our dplyr chain would thus look like:

outputs <- map(inputs, add_ten_safely)
outputs

[[1]]
[[1]]$result
[1] 24

[[1]]$error
NULL


[[2]]
[[2]]$result
[1] 61

[[2]]$error
NULL


[[3]]
[[3]]$result
[1] 90

[[3]]$error
NULL


[[4]]
[[4]]$result
[1] 100

[[4]]$error
NULL


[[5]]
[[5]]$result
NULL

[[5]]$error
<simpleError in .f(...): The output is too great!>


[[6]]
[[6]]$result
[1] 34

[[6]]$error
NULL


[[7]]
[[7]]$result
[1] 68

[[7]]$error
NULL


[[8]]
[[8]]$result
NULL

[[8]]$error
<simpleError in .f(...): The output is too great!>


[[9]]
[[9]]$result
[1] 85

[[9]]$error
NULL


[[10]]
[[10]]$result
[1] 98

[[10]]$error
NULL

What if we want to have a default value returned if there is an error? Well, in base R we’d do something like this:

add_ten_w_error_base <- function(x) {
  output <- x + 10
  if(output > 100){
    # send a message to the console as a side effect
    message("The output is too great!") 
    # return a value
    return(NA)
  }
  return(output)
}

outputs <- map(inputs, add_ten_w_error_base)

The output is too great!
The output is too great!

outputs

[[1]]
[1] 24

[[2]]
[1] 61

[[3]]
[1] 90

[[4]]
[1] 100

[[5]]
[1] NA

[[6]]
[1] 34

[[7]]
[1] 68

[[8]]
[1] NA

[[9]]
[1] 85

[[10]]
[1] 98

But in purrr, safely() comes with the option to just specify this in the function with the otherwise argument! Check it out:

add_ten_safely <- safely(add_ten, otherwise = NA)

outputs <- map(inputs, add_ten_safely)
outputs

[[1]]
[[1]]$result
[1] 24

[[1]]$error
NULL


[[2]]
[[2]]$result
[1] 61

[[2]]$error
NULL


[[3]]
[[3]]$result
[1] 90

[[3]]$error
NULL


[[4]]
[[4]]$result
[1] 100

[[4]]$error
NULL


[[5]]
[[5]]$result
[1] NA

[[5]]$error
<simpleError in .f(...): The output is too great!>


[[6]]
[[6]]$result
[1] 34

[[6]]$error
NULL


[[7]]
[[7]]$result
[1] 68

[[7]]$error
NULL


[[8]]
[[8]]$result
[1] NA

[[8]]$error
<simpleError in .f(...): The output is too great!>


[[9]]
[[9]]$result
[1] 85

[[9]]$error
NULL


[[10]]
[[10]]$result
[1] 98

[[10]]$error
NULL

This is very useful! What’s more, the possibly() function defaults to only returning the successful result or the error condition, so you don’t even have to deal with a janky list output:

add_ten_possibly <- possibly(add_ten, otherwise = NA)

outputs <- map(inputs, add_ten_possibly)
outputs

[[1]]
[1] 24

[[2]]
[1] 61

[[3]]
[1] 90

[[4]]
[1] 100

[[5]]
[1] NA

[[6]]
[1] 34

[[7]]
[1] 68

[[8]]
[1] NA

[[9]]
[1] 85

[[10]]
[1] 98

Which is easily parseable:

unlist(outputs)

 [1]  24  61  90 100  NA  34  68  NA  85  98

Why Is This Useful

I’d say this is a useful family of functions in a limited handful of scenarios, but comes in clutch when you meet them. When I first tried these functions out, I was processing a number of input files (n < 1000) with an external Matlab function that read in the file, calculated a parameter, and sent it back to R. In my experience, this approach was great because I 1) a long-ish list of inputs to a function, 2) had a function that took around 5-10 minutes to run, per input, and 3) had an expected failure case that I didn’t much care about (parameter inputs were sometimes invalid) and predictable/not unexpected, so I didn’t quite want to handle them with a within-function tryCatch strategy.

In fact, most programmers (probably Python folks) are probably asking right now, “why would’t you just use a tryCatch and not deal with another dependency?”

Well, the answer is that I think with this method, we keep the functions much more compact and straightforward, while also acknowledging that I will get errors returned when I expect them. This would be an unsafe approach when I do not know what inputs are expected, and what exactly can go wrong. But on this particular afternoon at work, I knew pretty much every input dataset, and knew/didn’t care about the reasons for a failure of the processing. I felt that this scenario lended itself well to the prima facie, handwavy approach of using otherwise in what’s essentially an apply call with syntactic sugar.

So, the lesson here is, use purrr functions instead of your loops. Or don’t, I guess. I’m not the expert here. I was honestly just tired and needed a better solution than “check each of these files for the different errors they could throw”, and for that, purrr worked out perfectly.

Anyway, here’s a perfect loop to summarise this blog post. Any loop can be perfect, but when they are, they’re kinda freaky. Best to expect some failures.

Footnotes

In truth, purrr doesn’t implement a for or while loop; it’s actually a tidy implementation of the *apply family of functions. Awesome!↩︎
I’m lying of course; there is imap; but that’s not why we’re here today↩︎