Practice: Random Forests basics

ECAS-SFdS 2023 School

Author

Robin Genuer

Published

October 8, 2023

1 Random forests on Ozone data

We consider a dataset on Ozone pollution in Los Angeles in 1976.

Load the dataset, available in the mlbench package, using the following commands:

data(Ozone, package = "mlbench")
# if the package is not yet installed, execute: install.packages('mlbench')
# str() gives the dataset structure
str(Ozone)

Consult the help page to get a full description of the dataset:

help(Ozone, package = "mlbench")

In this dataset, we have:

  • one continuous output variable (V4)

  • 12 input variables, which are either categorical (for example the month of the year, V1) or continuous (for example the wind speed at Los Angeles airport, V6).

1.1 A first RF

Load the randomForest package and consult the help page of the randomForest() function:

library("randomForest")
help("randomForest")

Build a random forests predictor, named rf, with default values of all parameters.
Determine its OOB error using the output print.
Apply the plot() function to the object rf and “check” if the number of trees is sufficient by looking at the plot (we should see a stabilization of the error when the number of trees increases).

rf <- randomForest(V4 ~ ., data = Ozone)
# We must tell to randomForest() how to deal with missing values.  We can, for
# simplicity, remove all missing values of the dataset (even if they are
# numerous for this example)
rf <- randomForest(V4 ~ ., data = Ozone, na.action = na.omit)
rf

Call:
 randomForest(formula = V4 ~ ., data = Ozone, na.action = na.omit) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 4

          Mean of squared residuals: 21.17211
                    % Var explained: 68.28
plot(rf)

# To simplify subsequent commands, we define xOz and yOz for inputs and
# outputs:
OzoneComp <- na.omit(Ozone)
xOz <- OzoneComp[-4]
yOz <- OzoneComp$V4
# The previous plot informs us that a number of trees smaller than 250 is too
# small. Again from the plot, we see that a number of trees between 300 and 500
# is reasonable. We can try to increase the number of trees and see what
# happens (take car of the computational time for this example):

rf1000 <- randomForest(xOz, yOz, ntree = 1000)
plot(rf1000)

# The OOB error is very stable when enough trees are grown, approximately 400
# for this example. Here, we can keep the default value which is appropriate,
# and in case of large execution times we can look for a smaller value.

1.2 Bagging predictor

Now build, still with the randomForest() function, a Bagging predictor, named bag (made of 500 maximal trees).
Compare its OOB error with the previous one.

bag <- randomForest(xOz, yOz, mtry = ncol(xOz))
bag

Call:
 randomForest(x = xOz, y = yOz, mtry = ncol(xOz)) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 12

          Mean of squared residuals: 26.17553
                    % Var explained: 60.78

1.3 Tuning of parameters

Tune now the number of variables selected at each node (we can try all possible values in this example since we only have 12 input variables), while letting the number of trees to its default value.

nbvars <- 1:ncol(xOz)
oobsMtry <- sapply(nbvars, function(nbv) {
    RF <- randomForest(xOz, yOz, mtry = nbv)
    return(RF$mse[RF$ntree])
})
# We could stabilize the error estimations by building a few forest per mtry
# value, at the price of an increasing runtime, e.g. by using:
replicate(n = 10, expr = randomForest(xOz, yOz)$mse[500])
 [1] 21.28603 21.75851 21.49250 21.36064 21.71337 21.43953 21.89062 21.33194
 [9] 21.13052 21.74646

1.4 Permutation-based variable importance

Set an additional parameter of randomForest() to get the variable importance scores and then apply the varImpPlot() function to the object.

rfImp <- randomForest(xOz, yOz, importance = TRUE)
varImpPlot(rfImp)

# To only plot the permutation-based varible importance and avoid the scaling
# of the mean decrease in accuracy by its standard deviation, use:
varImpPlot(rfImp, type = 1, scale = FALSE)

1.5 Other parameters and the object of class randomForest

Build a random forest predictor with the following parameters values:

(replace = FALSE, sampsize = nrow(xOz), mtry = ncol(xOz), ntree = 10, maxnodes = 5)

What are the characteristics of this RF ?
Look carefully at the forest component of the resulting object (which is a list) and figure out what its content means.

dumbRF <- randomForest(xOz, yOz, replace = FALSE, sampsize = nrow(xOz), mtry = ncol(xOz),
    ntree = 10, maxnodes = 5)
dumbRF$forest
# Bootstrap samples are actually not bootstrap samples: draws are made without
# replacement, and the number of observations drawn is fixed to the total
# number of observations in the dataset. So, in this case, all trees are built
# on the full original dataset.  Furthermore, all variables are chosen at each
# node because mtry = p.  Finally, the forest contains 10 identical trees made
# of 5 leaves.

2 Random forests on vac18 data

Load the dataset vac18 from the vac18.csv file:

vac18 <- read.csv("vac18.csv", row.names = 1, stringsAsFactors = TRUE)
yVac <- vac18$stimulation
xVac <- vac18[-ncol(vac18)]
dim(xVac)
xVac[1:6, 1:6]
str(yVac)

Consult the help page to get a full description of the dataset: vac18 help page on rdocumentation.org.

2.1 Number of trees

Apply the randomForest() function with all parameters default values and check if the number of trees is sufficient for this example.

rfVac <- randomForest(xVac, yVac)
rfVac

Call:
 randomForest(x = xVac, y = yVac) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 31

        OOB estimate of  error rate: 40.48%
Confusion matrix:
      GAG- GAG+ LIPO5 NS class.error
GAG-     2    3     0  5   0.8000000
GAG+     2    4     2  2   0.6000000
LIPO5    0    0    11  0   0.0000000
NS       2    1     0  8   0.2727273
plot(rfVac)

# In classification, we get one plain line for the 'global' error rate, and
# several other lines which show the per class error rate. Since there is no
# legend on the graph, one depicts which line correspond to each class by
# looking e.g. at the confusion matrix given in the preceding print result.

2.2 Tuning of mtry

For this dataset, testing all possible values of mtry between 1 and 1000 is quite demanding. Hence, we can start by testing a few values.

p <- ncol(xVac)
nbvars <- c(sqrt(p)/2, sqrt(p), 2 * sqrt(p), p/10, p/4, p/3, p/2, 2 * p/3, 3 * p/4,
    p)
oobsMtry <- sapply(nbvars, function(nbv) {
    RF <- randomForest(xVac, yVac, mtry = nbv)
    return(RF$err.rate[RF$ntree, "OOB"])
})
cbind(nbvars = floor(nbvars), oobsMtry = oobsMtry)
    nbvars  oobsMtry
OOB     15 0.4285714
OOB     31 0.4285714
OOB     63 0.3571429
OOB    100 0.2619048
OOB    250 0.2619048
OOB    333 0.3095238
OOB    500 0.2619048
OOB    666 0.2619048
OOB    750 0.3095238
OOB   1000 0.2619048
plot(oobsMtry ~ nbvars, type = "b")

2.3 Permutation-based variable importance

Build a RF, that computes the permutation-based variable importance scores, with 2000 trees and mtry set at 200.
Plot the variable importance scores.
Retrieve the indices and the names of the 15 most important variables.

rfVacImp <- randomForest(xVac, yVac, mtry = 200, ntree = 2000, importance = TRUE)
varImpPlot(rfVacImp, type = 1, scale = FALSE, cex = 0.8)

impSort <- sort(rfVacImp$importance[, "MeanDecreaseAccuracy"], decreasing = TRUE,
    index.return = TRUE)
indVars <- impSort$ix[1:15]
indVars
 [1] 676 285 366 746 906 238 463 104 520 221 147 840 277 554 797
nameVars <- colnames(xVac[, indVars])
nameVars
 [1] "ILMN_1718766" "ILMN_1691156" "ILMN_1775170" "ILMN_1686664" "ILMN_2124802"
 [6] "ILMN_2136089" "ILMN_2102693" "ILMN_1736939" "ILMN_1745271" "ILMN_1695421"
[11] "ILMN_1658396" "ILMN_3252733" "ILMN_2261784" "ILMN_3251511" "ILMN_2188204"