library(dplyr)
library(purrr)
set.seed(12345)
Looping is a fundamental programming paradigm. You have a set of inputs, and you wanna run a function on each of them:
<- sample(1:100, 10)
inputs
<- function(x) {
add_ten return(x + 10)
}
# base R looping
<- c()
outputs for(x in inputs){
<- c(outputs, (add_ten(x)))
outputs
}# \loop
print(glue::glue("{inputs} -> {outputs}"))
14 -> 24
51 -> 61
80 -> 90
90 -> 100
92 -> 102
24 -> 34
58 -> 68
93 -> 103
75 -> 85
88 -> 98
With the purrr
library, we get the same functionality as looping1 but with an arguably friendlier interface and more compliant mechanics with the idiosyncracies of the tidyverse
:
# much nicer!
<- map(inputs, add_ten)
outputs print(glue::glue("{inputs} -> {outputs}"))
14 -> 24
51 -> 61
80 -> 90
90 -> 100
92 -> 102
24 -> 34
58 -> 68
93 -> 103
75 -> 85
88 -> 98
This is all fine and dandy, but let’s say you get a failure from the data, like, add_ten
throws an error if the output is greater than 100:
<- function(x) {
add_ten <- x + 10
output if(output > 100){
stop("The output is too great!")
}return(output)
}
In a for loop, this fails as expected:
<- c()
outputs for(x in inputs){
<- c(outputs, (add_ten(x)))
outputs }
Error in add_ten(x): The output is too great!
If I had to debug it this code, I would probably set up an iterator:
<- c()
outputs for(x in 1:length(inputs)){
print(x)
<- c(outputs, (add_ten(inputs[x])))
outputs }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Error in add_ten(inputs[x]): The output is too great!
It failed at 5, so I’ll check inputs[5]
and debug:
5] inputs[
[1] 92
# the output would be greater than 100! Duh!!
But with purrr::map()
, there isn’t a straightforward way to debug like this2. And if you have a large dataset, with a long-running function, you probably don’t want to wait until the map
call fails and you have to go digging around into exactly which object in the vector had the problem.
Enter: safely()
and possibly()
.
These are two functions that modify the behaviour of a purrr
call. You can wrap your function in one of these, and purrr
will give you a couple of ways of managing what happens if and when your loop fails or throws some kind of warning or unexpected output. Here’s an example with the add_ten
function, using quietly()
to force map
to keep going even if there’s a failure:
<- function(x) {
add_ten <- x + 10
output if(output > 100){
stop("The output is too great!")
}return(output)
}
<- safely(add_ten) add_ten_safely
<- add_ten_safely(10)
out out
$result
[1] 20
$error
NULL
<- add_ten_safely(95)
out out
$result
NULL
$error
<simpleError in .f(...): The output is too great!>
Here, we see that safely()
returns a list of outputs, with result
and error
. Implementing this in our dplyr
chain would thus look like:
<- map(inputs, add_ten_safely)
outputs outputs
[[1]]
[[1]]$result
[1] 24
[[1]]$error
NULL
[[2]]
[[2]]$result
[1] 61
[[2]]$error
NULL
[[3]]
[[3]]$result
[1] 90
[[3]]$error
NULL
[[4]]
[[4]]$result
[1] 100
[[4]]$error
NULL
[[5]]
[[5]]$result
NULL
[[5]]$error
<simpleError in .f(...): The output is too great!>
[[6]]
[[6]]$result
[1] 34
[[6]]$error
NULL
[[7]]
[[7]]$result
[1] 68
[[7]]$error
NULL
[[8]]
[[8]]$result
NULL
[[8]]$error
<simpleError in .f(...): The output is too great!>
[[9]]
[[9]]$result
[1] 85
[[9]]$error
NULL
[[10]]
[[10]]$result
[1] 98
[[10]]$error
NULL
What if we want to have a default value returned if there is an error? Well, in base R we’d do something like this:
<- function(x) {
add_ten_w_error_base <- x + 10
output if(output > 100){
# send a message to the console as a side effect
message("The output is too great!")
# return a value
return(NA)
}return(output)
}
<- map(inputs, add_ten_w_error_base) outputs
The output is too great!
The output is too great!
outputs
[[1]]
[1] 24
[[2]]
[1] 61
[[3]]
[1] 90
[[4]]
[1] 100
[[5]]
[1] NA
[[6]]
[1] 34
[[7]]
[1] 68
[[8]]
[1] NA
[[9]]
[1] 85
[[10]]
[1] 98
But in purrr
, safely()
comes with the option to just specify this in the function with the otherwise
argument! Check it out:
<- safely(add_ten, otherwise = NA)
add_ten_safely
<- map(inputs, add_ten_safely)
outputs outputs
[[1]]
[[1]]$result
[1] 24
[[1]]$error
NULL
[[2]]
[[2]]$result
[1] 61
[[2]]$error
NULL
[[3]]
[[3]]$result
[1] 90
[[3]]$error
NULL
[[4]]
[[4]]$result
[1] 100
[[4]]$error
NULL
[[5]]
[[5]]$result
[1] NA
[[5]]$error
<simpleError in .f(...): The output is too great!>
[[6]]
[[6]]$result
[1] 34
[[6]]$error
NULL
[[7]]
[[7]]$result
[1] 68
[[7]]$error
NULL
[[8]]
[[8]]$result
[1] NA
[[8]]$error
<simpleError in .f(...): The output is too great!>
[[9]]
[[9]]$result
[1] 85
[[9]]$error
NULL
[[10]]
[[10]]$result
[1] 98
[[10]]$error
NULL
This is very useful! What’s more, the possibly()
function defaults to only returning the successful result or the error condition, so you don’t even have to deal with a janky list output:
<- possibly(add_ten, otherwise = NA)
add_ten_possibly
<- map(inputs, add_ten_possibly)
outputs outputs
[[1]]
[1] 24
[[2]]
[1] 61
[[3]]
[1] 90
[[4]]
[1] 100
[[5]]
[1] NA
[[6]]
[1] 34
[[7]]
[1] 68
[[8]]
[1] NA
[[9]]
[1] 85
[[10]]
[1] 98
Which is easily parseable:
unlist(outputs)
[1] 24 61 90 100 NA 34 68 NA 85 98
Why Is This Useful
I’d say this is a useful family of functions in a limited handful of scenarios, but comes in clutch when you meet them. When I first tried these functions out, I was processing a number of input files (n < 1000) with an external Matlab function that read in the file, calculated a parameter, and sent it back to R. In my experience, this approach was great because I 1) a long-ish list of inputs to a function, 2) had a function that took around 5-10 minutes to run, per input, and 3) had an expected failure case that I didn’t much care about (parameter inputs were sometimes invalid) and predictable/not unexpected, so I didn’t quite want to handle them with a within-function tryCatch
strategy.
In fact, most programmers (probably Python folks) are probably asking right now, “why would’t you just use a tryCatch
and not deal with another dependency?”
Well, the answer is that I think with this method, we keep the functions much more compact and straightforward, while also acknowledging that I will get errors returned when I expect them. This would be an unsafe approach when I do not know what inputs are expected, and what exactly can go wrong. But on this particular afternoon at work, I knew pretty much every input dataset, and knew/didn’t care about the reasons for a failure of the processing. I felt that this scenario lended itself well to the prima facie, handwavy approach of using otherwise
in what’s essentially an apply
call with syntactic sugar.
So, the lesson here is, use purrr
functions instead of your loops. Or don’t, I guess. I’m not the expert here. I was honestly just tired and needed a better solution than “check each of these files for the different errors they could throw”, and for that, purrr
worked out perfectly.
Anyway, here’s a perfect loop to summarise this blog post. Any loop can be perfect, but when they are, they’re kinda freaky. Best to expect some failures.
Missed a spot… [L]
byu/igneus inperfectloops