Use mlrMBO to optimize via command line

Many people who want to apply Bayesian optimization want to use it to optimize an algorithm that is not implemented in R but runs on the command line as a shell script or an executable.

We recently published mlrMBO on CRAN. As a normal package it normally operates inside of R, but with this post I want to demonstrate how mlrMBO can be used to optimize an external application. At the same time I will highlight some issues you can likely run into.

First of all we need a bash script that we want to optimize. This tutorial will only run on Unix systems (Linux, OSX etc.) but should also be informative for windows users. The following code will write a tiny bash script that uses bc to calculate $sin(x_1-1) + (x_1^2 + x_2^2)$ and write the result “hidden” in a sentence (The result is 12.34!) in a result.txt text file.

The bash script

# write bash script
lines = '#!/bin/bash
fun ()
{
  x1=$1
  x2=$2
  command="(s($x1-1) + ($x1^2 + $x2^2))"
  result=$(bc -l <<< $command)
}
echo "Start calculation."
fun $1 $2
echo "The result is $result!" > "result.txt"
echo "Finish calculation."
'
writeLines(lines, "fun.sh")
# make it executable:
system("chmod +x fun.sh")

Running the script from R

Now we need a R function that starts the script, reads the result from the text file and returns it.

library(stringi)
runScript = function(x) {
  command = sprintf("./fun.sh %f %f", x[['x1']], x[['x2']])
  error.code = system(command)
  if (error.code != 0) {
    stop("Simulation had error.code != 0!")
  }
  result = readLines("result.txt")
  # the pattern matches 12 as well as 12.34 and .34
  # the ?: makes the decimals a non-capturing group.
  result = stri_match_first_regex(result, pattern = "\\d*(?:\\.\\d+)?(?=\\!)")
  as.numeric(result)
}

This function uses stringi and regular expressions to match the result within the sentence. Depending on the output different strategies to read the result make sense. XML files can usually be accessed with XML::xmlParse, XML::getNodeSet, XML::xmlAttrs etc. using XPath queries. Sometimes the good old read.table() is also sufficient. If, for example, the output is written in a file like this:

value1 = 23.45
value2 = 13.82

You can easily use source() like that:

EV = new.env()
eval(expr = {a = 1}, envir = EV)
as.list(EV)
source(file = "result.txt", local = EV)
res = as.list(EV)
rm(EV)

which will return a list with the entries $value1 and $value2.

Define bounds, wrap function.

To evaluate the function from within mlrMBO it has to be wrapped in smoof function. The smoof function also contains information about the bounds and scales of the domain of the objective function defined in a ParameterSet.

library(mlrMBO)
# Defining the bounds of the parameters:
par.set = makeParamSet(
  makeNumericParam("x1", lower = -3, upper = 3),
  makeNumericParam("x2", lower = -2.5, upper = 2.5)
)
# Wrapping everything in a smoof function:
fn = makeSingleObjectiveFunction(
  id = "fun.sh", 
  fn = runScript,
  par.set = par.set,
  has.simple.signature = FALSE
)

# let's see if the function is working
des = generateGridDesign(par.set, resolution = 3)
des$y = apply(des, 1, fn)
des
##   x1   x2         y
## 1 -3 -2.5 16.006802
## 2  0 -2.5  5.408529
## 3  3 -2.5 16.159297
## 4 -3  0.0  9.756802
## 5  0  0.0  0.841471
## 6  3  0.0  9.909297
## 7 -3  2.5 16.006802
## 8  0  2.5  5.408529
## 9  3  2.5 16.159297

If you run this locally, you will see that the console output generated by our shell script directly appears in the R-console. This can be helpful but also annoying.

Redirecting output

If a lot of output is generated during a single call of system() it might even crash R. To avoid that I suggest to redirect the output into a file. This way no output is lost and the R console does not get flooded. We can simply achieve that by replacing the command in the function runScript from above with the following code:

  # console output file output_1490030005_1.1_2.4.txt
  output_file = sprintf("output_%i_%.1f_%.1f.txt", as.integer(Sys.time()), x[['x1']], x[['x2']])
  # redirect output with ./fun.sh 1.1 2.4 > output.txt
  # alternative: ./fun.sh 1.1 2.4 > /dev/null to drop it
  command = sprintf("./fun.sh %f %f > %s", x[['x1']], x[['x2']], output_file)

Start the Optimization

Now everything is set so we can proceed with the usual MBO setup:

ctrl = makeMBOControl()
ctrl = setMBOControlInfill(ctrl, crit = crit.ei)
ctrl = setMBOControlTermination(ctrl, iters = 10)
configureMlr(show.info = FALSE, show.learner.output = FALSE)
run = mbo(fun = fn, control = ctrl)
## Computing y column(s) for design. Not provided.
## [mbo] 0: x1=-1.58; x2=-1.64 : y = 4.65 : 0.0 secs : initdesign
## [mbo] 0: x1=-0.251; x2=0.0593 : y = 0.883 : 0.0 secs : initdesign
## [mbo] 0: x1=1.04; x2=2.05 : y = 5.3 : 0.0 secs : initdesign
## [mbo] 0: x1=-2.39; x2=-0.345 : y = 6.07 : 0.0 secs : initdesign
## [mbo] 0: x1=0.608; x2=-0.742 : y = 0.538 : 0.0 secs : initdesign
## [mbo] 0: x1=2.85; x2=0.9 : y = 9.87 : 0.0 secs : initdesign
## [mbo] 0: x1=2.07; x2=-2.11 : y = 9.58 : 0.0 secs : initdesign
## [mbo] 0: x1=-0.967; x2=1.25 : y = 1.58 : 0.0 secs : initdesign
## [mbo] 1: x1=0.301; x2=-2.08 : y = 3.79 : 0.0 secs : infill_ei
## [mbo] 2: x1=0.375; x2=-0.191 : y = 0.408 : 0.0 secs : infill_ei
## [mbo] 3: x1=0.189; x2=-0.586 : y = 0.346 : 0.0 secs : infill_ei
## [mbo] 4: x1=0.397; x2=-0.516 : y = 0.144 : 0.0 secs : infill_ei
## [mbo] 5: x1=-0.702; x2=-0.45 : y = 0.296 : 0.0 secs : infill_ei
## [mbo] 6: x1=-0.476; x2=-0.536 : y = 0.481 : 0.0 secs : infill_ei
## [mbo] 7: x1=-0.982; x2=-0.258 : y = 0.115 : 0.0 secs : infill_ei
## [mbo] 8: x1=-0.94; x2=0.0274 : y = 0.0493 : 0.0 secs : infill_ei
## [mbo] 9: x1=-1.09; x2=0.238 : y = 0.364 : 0.0 secs : infill_ei
## [mbo] 10: x1=-0.897; x2=-0.102 : y = 0.132 : 0.0 secs : infill_ei
# The resulting optimal configuration:
run$x
## $x1
## [1] -0.9395412
## 
## $x2
## [1] 0.02737539
# The best reached value:
run$y
## [1] 0.04929388

Execute the R script from a shell

Also you might not want to bothered having to start R and run this script manually so what I would recommend is saving all above as an R-script plus some lines that write the output in a JSON file like this:

library(jsonlite)
write_json(run[c("x","y")], "mbo_res.json")

Let’s assume we saved all of that above as an R-script under the name runMBO.R (actually it is available as a gist).

Then you can simply run it from the command line:

Rscript runMBO.R 

As an extra the script in the gist also contains a simple handler for command line arguments. In this case you can define the number of optimization iterations and the maximal allowed time in seconds for the optimization. You can also define the seed to make runs reproducible:

Rscript runMBO.R iters=20 time=10 seed=3

If you want to build a more advanced command line interface you might want to have a look at docopt.

Clean up

To clean up all the files generated by this script you can run:

file.remove("result.txt")
file.remove("fun.sh")
file.remove("mbo_res.json")
output.files = list.files(pattern = "output_\\d+_[0-9_.-]+\\.txt")
file.remove(output.files)
Written on March 22, 2017 by Jakob Richter