Parallelize 'pbapply' functions

The 'pbapply' image + The 'futurize' hexlogo = The 'future' logo

The futurize package allows you to easily turn sequential code into parallel code by piping the sequential code to the futurize() function. Easy!

TL;DR

library(futurize)
plan(multisession)
library(pbapply)

slow_fcn <- function(x) {
  message("x = ", x)
  Sys.sleep(0.1)  # emulate work
  sqrt(x)
}

xs <- 1:100
ys <- pblapply(xs, slow_fcn) |> futurize()

Introduction

This vignette demonstrates how to use this approach to parallelize pbapply functions such as pblapply(), pbsapply(), and pbvapply().

The pbapply package provides progress-bar versions of the base-R *apply() family of functions. It supports parallel processing via the cl argument, which accepts a PSOCK cluster object or, when used with futurize, the string "future".

Example: Parallel lapply with progress bar

The pblapply() function works like lapply() but displays a progress bar. For example:

library(pbapply)

slow_fcn <- function(x) {
  Sys.sleep(0.1)  # emulate work
  sqrt(x)
}

## Apply a function to each element with a progress bar
xs <- 1:100
ys <- pblapply(xs, slow_fcn)

Here pblapply() evaluates sequentially, but we can easily make it evaluate in parallel by piping to futurize():

library(pbapply)

library(futurize)
plan(multisession) ## parallelize on local machine

xs <- 1:100
ys <- pblapply(xs, slow_fcn) |> futurize()

Comment: The message("x = ", x) output is not relayed to the main R session by design, because if it were, it would clutter up the progress bar that pbapply renders, which is the whole purpose of using pbapply in the first place.

The built-in multisession backend parallelizes on your local computer and works on all operating systems. There are other parallel backends to choose from, including alternatives to parallelize locally as well as distributed across remote machines, e.g.

plan(future.mirai::mirai_multisession)

and

plan(future.batchtools::batchtools_slurm)

Example: Parallel sapply with progress bar

The pbsapply() function simplifies the result like sapply():

library(futurize)
plan(multisession)
library(pbapply)

xs <- 1:100
ys <- pbsapply(xs, slow_fcn) |> futurize()

Supported Functions

The following pbapply functions are supported by futurize():

Without futurize: Manual PSOCK cluster setup

For comparison, here is what it takes to parallelize pblapply() using the parallel package directly, without futurize:

library(pbapply)
library(parallel)

## Set up a PSOCK cluster
ncpus <- 4L
cl <- makeCluster(ncpus)

## Run pblapply in parallel
xs <- 1:100
ys <- pblapply(xs, slow_fcn, cl = cl)

## Tear down the cluster
stopCluster(cl)

This requires you to manually create and manage the cluster lifecycle. If you forget to call stopCluster(), or if your code errors out before reaching it, you leak background R processes. You also have to decide upfront how many CPUs to use and what cluster type to use. Switching to another parallel backend, e.g. a Slurm cluster, would require a completely different setup. With futurize, all of this is handled for you - just pipe to futurize() and control the backend with plan().

Progress Reporting via progressr

An alternative to using pbapply for progress reporting is to use the progressr package, which is specially designed to work with the Futureverse ecosystem and provide progress updates from parallelized computations in a near-live fashion. See the vignette("futurize-11-apply", package = "futurize") for more details.