data(Ozone, package = "mlbench")
# if the package is not yet installed, execute: install.packages('mlbench')
# str() gives the dataset structure
str(Ozone)
1 Random forests on Ozone data
We consider a dataset on Ozone pollution in Los Angeles in 1976.
Load the dataset, available in the mlbench
package, using the following commands:
Consult the help page to get a full description of the dataset:
help(Ozone, package = "mlbench")
In this dataset, we have:
one continuous output variable (
V4
)12 input variables, which are either categorical (for example the month of the year,
V1
) or continuous (for example the wind speed at Los Angeles airport,V6
).
1.1 A first RF
Load the randomForest
package and consult the help page of the randomForest()
function:
library("randomForest")
help("randomForest")
Build a random forests predictor, named rf
, with default values of all parameters.
Determine its OOB error using the output print.
Apply the plot()
function to the object rf
and “check” if the number of trees is sufficient by looking at the plot (we should see a stabilization of the error when the number of trees increases).
<- randomForest(V4 ~ ., data = Ozone)
rf # We must tell to randomForest() how to deal with missing values. We can, for
# simplicity, remove all missing values of the dataset (even if they are
# numerous for this example)
<- randomForest(V4 ~ ., data = Ozone, na.action = na.omit)
rf rf
Call:
randomForest(formula = V4 ~ ., data = Ozone, na.action = na.omit)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 4
Mean of squared residuals: 21.17211
% Var explained: 68.28
plot(rf)
# To simplify subsequent commands, we define xOz and yOz for inputs and
# outputs:
<- na.omit(Ozone)
OzoneComp <- OzoneComp[-4]
xOz <- OzoneComp$V4 yOz
# The previous plot informs us that a number of trees smaller than 250 is too
# small. Again from the plot, we see that a number of trees between 300 and 500
# is reasonable. We can try to increase the number of trees and see what
# happens (take car of the computational time for this example):
<- randomForest(xOz, yOz, ntree = 1000)
rf1000 plot(rf1000)
# The OOB error is very stable when enough trees are grown, approximately 400
# for this example. Here, we can keep the default value which is appropriate,
# and in case of large execution times we can look for a smaller value.
1.2 Bagging predictor
Now build, still with the randomForest()
function, a Bagging predictor, named bag
(made of 500 maximal trees).
Compare its OOB error with the previous one.
<- randomForest(xOz, yOz, mtry = ncol(xOz))
bag bag
Call:
randomForest(x = xOz, y = yOz, mtry = ncol(xOz))
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 12
Mean of squared residuals: 26.17553
% Var explained: 60.78
1.3 Tuning of parameters
Tune now the number of variables selected at each node (we can try all possible values in this example since we only have 12 input variables), while letting the number of trees to its default value.
<- 1:ncol(xOz)
nbvars <- sapply(nbvars, function(nbv) {
oobsMtry <- randomForest(xOz, yOz, mtry = nbv)
RF return(RF$mse[RF$ntree])
})
# We could stabilize the error estimations by building a few forest per mtry
# value, at the price of an increasing runtime, e.g. by using:
replicate(n = 10, expr = randomForest(xOz, yOz)$mse[500])
[1] 21.28603 21.75851 21.49250 21.36064 21.71337 21.43953 21.89062 21.33194
[9] 21.13052 21.74646
1.4 Permutation-based variable importance
Set an additional parameter of randomForest()
to get the variable importance scores and then apply the varImpPlot()
function to the object.
<- randomForest(xOz, yOz, importance = TRUE)
rfImp varImpPlot(rfImp)
# To only plot the permutation-based varible importance and avoid the scaling
# of the mean decrease in accuracy by its standard deviation, use:
varImpPlot(rfImp, type = 1, scale = FALSE)
1.5 Other parameters and the object of class randomForest
Build a random forest predictor with the following parameters values:
(replace = FALSE, sampsize = nrow(xOz), mtry = ncol(xOz), ntree = 10, maxnodes = 5)
What are the characteristics of this RF ?
Look carefully at the forest
component of the resulting object (which is a list) and figure out what its content means.
<- randomForest(xOz, yOz, replace = FALSE, sampsize = nrow(xOz), mtry = ncol(xOz),
dumbRF ntree = 10, maxnodes = 5)
$forest dumbRF
# Bootstrap samples are actually not bootstrap samples: draws are made without
# replacement, and the number of observations drawn is fixed to the total
# number of observations in the dataset. So, in this case, all trees are built
# on the full original dataset. Furthermore, all variables are chosen at each
# node because mtry = p. Finally, the forest contains 10 identical trees made
# of 5 leaves.
2 Random forests on vac18 data
Load the dataset vac18
from the vac18.csv
file:
<- read.csv("vac18.csv", row.names = 1, stringsAsFactors = TRUE)
vac18 <- vac18$stimulation
yVac <- vac18[-ncol(vac18)]
xVac dim(xVac)
1:6, 1:6]
xVac[str(yVac)
Consult the help page to get a full description of the dataset: vac18 help page on rdocumentation.org.
2.1 Number of trees
Apply the randomForest()
function with all parameters default values and check if the number of trees is sufficient for this example.
<- randomForest(xVac, yVac)
rfVac rfVac
Call:
randomForest(x = xVac, y = yVac)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 31
OOB estimate of error rate: 40.48%
Confusion matrix:
GAG- GAG+ LIPO5 NS class.error
GAG- 2 3 0 5 0.8000000
GAG+ 2 4 2 2 0.6000000
LIPO5 0 0 11 0 0.0000000
NS 2 1 0 8 0.2727273
plot(rfVac)
# In classification, we get one plain line for the 'global' error rate, and
# several other lines which show the per class error rate. Since there is no
# legend on the graph, one depicts which line correspond to each class by
# looking e.g. at the confusion matrix given in the preceding print result.
2.2 Tuning of mtry
For this dataset, testing all possible values of mtry
between 1 and 1000 is quite demanding. Hence, we can start by testing a few values.
<- ncol(xVac)
p <- c(sqrt(p)/2, sqrt(p), 2 * sqrt(p), p/10, p/4, p/3, p/2, 2 * p/3, 3 * p/4,
nbvars
p)<- sapply(nbvars, function(nbv) {
oobsMtry <- randomForest(xVac, yVac, mtry = nbv)
RF return(RF$err.rate[RF$ntree, "OOB"])
})cbind(nbvars = floor(nbvars), oobsMtry = oobsMtry)
nbvars oobsMtry
OOB 15 0.4285714
OOB 31 0.4285714
OOB 63 0.3571429
OOB 100 0.2619048
OOB 250 0.2619048
OOB 333 0.3095238
OOB 500 0.2619048
OOB 666 0.2619048
OOB 750 0.3095238
OOB 1000 0.2619048
plot(oobsMtry ~ nbvars, type = "b")
2.3 Permutation-based variable importance
Build a RF, that computes the permutation-based variable importance scores, with 2000 trees and mtry
set at 200.
Plot the variable importance scores.
Retrieve the indices and the names of the 15 most important variables.
<- randomForest(xVac, yVac, mtry = 200, ntree = 2000, importance = TRUE) rfVacImp
varImpPlot(rfVacImp, type = 1, scale = FALSE, cex = 0.8)
<- sort(rfVacImp$importance[, "MeanDecreaseAccuracy"], decreasing = TRUE,
impSort index.return = TRUE)
<- impSort$ix[1:15]
indVars indVars
[1] 676 285 366 746 906 238 463 104 520 221 147 840 277 554 797
<- colnames(xVac[, indVars])
nameVars nameVars
[1] "ILMN_1718766" "ILMN_1691156" "ILMN_1775170" "ILMN_1686664" "ILMN_2124802"
[6] "ILMN_2136089" "ILMN_2102693" "ILMN_1736939" "ILMN_1745271" "ILMN_1695421"
[11] "ILMN_1658396" "ILMN_3252733" "ILMN_2261784" "ILMN_3251511" "ILMN_2188204"