data(Ozone, package = "mlbench")
# if the package is not yet installed, execute:
# install.packages("mlbench")
# str() gives the dataset structure
str(Ozone)
1 Regression tree on Ozone data
We consider a dataset on Ozone pollution in Los Angeles in 1976.
Load the dataset, available in the mlbench
package, using the following commands:
Consult the help page to get a full description of the dataset:
help(Ozone, package = "mlbench")
In this dataset, we have:
one continuous output variable (
V4
)12 input variables, which are either categorical (for example the month of the year,
V1
) or continuous (for example the wind speed at Los Angeles airport,V6
).
1.1 Intialization
Load the rpart
package (for recursive partitioning), then read the beginning of rpart()
function help:
library("rpart")
help(rpart)
1.2 Tree building.
- Build one regression tree
Tree
with therpart()
function (be careful that the function requires for its first argument a formula as: \(\mathtt{y \sim x}\) and a dataset for its second argument, we do not consider the other parameters for now).
<- rpart(V4~., data = Ozone) Tree
- Print the tree by just executing
Tree
and plot the tree usingplot(t)
followed bytext(t)
(the graphics device must be active in order to thetext()
function to work). If the plot is not satisfying, try to set thexpd
(for “expand”) parameter of thetext()
function toTRUE
.
Treeplot(Tree)
text(Tree, xpd = TRUE)
Remark: in Quarto/Rmarkdown chunk, the entire chunk must be executed at once, and not line by line (otherwise the text()
function cannot run).
- Using the
predict()
function, predict data of the learning set. Thepredict()
function requires the tree object and the name of the dataset to predict (look at thepredict.rpart()
help page).
Take a look at the prediction value of the 3rd observation ofOzone
dataframe and retrieve its path trough the tree. Compute the empirical error ofTree
.
[Take care of the missing values ofV4
]
<- predict(object = Tree, newdata = Ozone)
TreePred 3] TreePred[
3
5.424242
head(Ozone)
# For the 3rd observation:
# V8 = 40 < 67.5
# V10 = 2693 < 3574
# V1 = 1 (January) belongs to "al" (January, December)
# Hence the observation falls in the leaf the second most on the left.
<- mean((Ozone$V4 - TreePred)^2)
empErr empErr
[1] NA
# we get NA because of the missing values in V4
# we can remove missing values inside the mean() function:
<- mean((Ozone$V4 - TreePred)^2, na.rm = TRUE)
empErr empErr
[1] 12.56272
# or remove them by hand:
<- which(is.na(Ozone$V4))
naV4 <- mean((Ozone$V4[-naV4] - TreePred[-naV4])^2)
empErr2 empErr2
[1] 12.56272
Remark: The behavior of the predict()
function is different if you don’t enter the newdata
parameter. Execute predict(object = Tree)
and calculate the length of the obtained predictions vector.
1.3 Building parameters
Determine which tree is built when the following commands are executed (see the rpart.control()
help page for more information on parameters of the tree building):
<- rpart(V4 ~ ., data = Ozone, maxdepth = 1)
Tree1 <- rpart(V4 ~ ., data = Ozone, minsplit = 2, cp = 0) Tree2
1.4 Empirical errors
Print and plot Tree1
and Tree2
and compute their empirical errors.
Tree1
n=361 (5 observations effacées parce que manquantes)
node), split, n, deviance, yval
* denotes terminal node
1) root 361 22558.000 11.526320
2) V8< 67.5 232 4416.069 7.293103 *
3) V8>=67.5 129 6507.488 19.139530 *
plot(Tree1) ; text(Tree1, xpd = TRUE)
<- predict(Tree1, Ozone)
Tree1Pred <- mean((Ozone$V4 - Tree1Pred)^2, na.rm = TRUE)
empErr1 empErr1
[1] 30.25916
Tree2plot(Tree2)
# text(Tree2)
<- predict(Tree2, Ozone)
Tree2Pred <- mean((Ozone$V4 - Tree2Pred)^2, na.rm = TRUE) empErr2
empErr2
[1] 0
Remark: We perfectly predict learning sample observations here. This illustrates the fact that the maximal tree largely over-fits the data.
1.5 Pruning
Let us start by renaming Tree2
as TreeMax
to be more explicit:
<- Tree2 TreeMax
- From the maximal tree object, print the results of the nested pruned sub-trees (look at
rpart.object
help page).
$cptable TreeMax
- Determine the complexity value corresponding to the minimum of the cross-validation error (indicated in the
xerror
column).
<- which.min(TreeMax$cptable[, "xerror"])
indCPopt indCPopt
7
7
<- TreeMax$cptable[indCPopt, "CP"]
CPopt CPopt
[1] 0.02300046
- Prune the maximal tree using the complexity value just found using the
prune()
function.
<- prune(TreeMax, cp = CPopt)
TreeOpt plot(TreeOpt)
text(TreeOpt, xpd = TRUE)
2 Diving into CART details
Let us continue further on Ozone data.
2.1 yval
- Let start again with
Tree1
the CART tree with only two leaves. Print the resulting treeTree1
.
Tree1
n=361 (5 observations effacées parce que manquantes)
node), split, n, deviance, yval
* denotes terminal node
1) root 361 22558.000 11.526320
2) V8< 67.5 232 4416.069 7.293103 *
3) V8>=67.5 129 6507.488 19.139530 *
- For each node, the last information given in the print,
yval
, is the value of \(Y\) associated to the current node. Check that those values ofyval
are approximately what you expected (with a precision of \(10^{-2}\)).
[Be aware of missing values]
tapply(Ozone$V4, Ozone$V8 < 67.5, mean, na.rm = TRUE)
FALSE TRUE
19.139535 7.286957
- Determine why those values are not exactly the same at a precision of \(10^{-6}\).
# There are some missing values for V8, but the tree assigns them anyway
# to leaves, as it can be seen in where component of object Tree1:
tapply(na.omit(Ozone$V4), Tree1$where == 2, mean, na.rm = TRUE)
FALSE TRUE
19.139535 7.293103
2.2 Competing splits
- Apply the
summary()
function toTree1
.
summary(Tree1)
Call:
rpart(formula = V4 ~ ., data = Ozone, maxdepth = 1)
n=361 (5 observations effacées parce que manquantes)
CP nsplit rel error xerror xstd
1 0.5157568 0 1.0000000 1.007908 0.07477639
2 0.0100000 1 0.4842432 0.529678 0.04419110
Variable importance
V8 V12 V5 V1 V7 V10
38 21 18 18 3 3
Node number 1: 361 observations, complexity param=0.5157568
mean=11.52632, MSE=62.48753
left son=2 (232 obs) right son=3 (129 obs)
Primary splits:
V8 < 67.5 to the left, improve=0.5146929, (2 missing)
V12 < 63.59 to the left, improve=0.4221966, (14 missing)
V10 < 3472.5 to the right, improve=0.3188287, (15 missing)
V9 < 62.87 to the left, improve=0.3174874, (137 missing)
V1 splits as LLLLRRRRRRLL, improve=0.3079652, (0 missing)
Surrogate splits:
V12 < 67.19 to the left, agree=0.836, adj=0.543, (2 split)
V5 < 5795 to the left, agree=0.813, adj=0.481, (0 split)
V1 splits as LLLLRRRRRLLL, agree=0.808, adj=0.465, (0 split)
V7 < 70.5 to the left, agree=0.669, adj=0.078, (0 split)
V10 < 2396 to the right, agree=0.666, adj=0.070, (0 split)
Node number 2: 232 observations
mean=7.293103, MSE=19.03478
Node number 3: 129 observations
mean=19.13953, MSE=50.44565
- In the result, the “improve” quantity correspond to the proportion of gain in terms of homogeneity brought by the split. Hence, the more the “improve” value is, the more the decrease in in-node variance of child nodes is.
- Explain what is the split “V12 < 63.59” and why it is printed at this place.
# The split "V12 < 63.59" is the second in decreasing order of gain of
# homogeneity, i.e. this is the split that would have been chosen if
# V8 was not available in the data.
- Compute variances of the child nodes effectively built. Then compute variances of child nodes that would have been built if the split “V12 < 63.59” was used.
tapply(Ozone$V4, Ozone$V8 < 67.5, var, na.rm = TRUE)
FALSE TRUE
50.83975 19.06140
tapply(Ozone$V4, Ozone$V12 < 63.59, var, na.rm = TRUE)
FALSE TRUE
61.14879 14.67107
# We see that the right child node variance is really smaller for the
# optimal split, whereas the left child node one is larger.
# To check that the homogeneity gain is effectively larger with the optimal
# split, a weighted mean of the two variances must be computed.
<- Ozone[!is.na(Ozone$V4),]
OzV4NA sum((table(OzV4NA$V8 < 67.5) * tapply(OzV4NA$V4, OzV4NA$V8 < 67.5, var, na.rm = TRUE)) / sum(table(OzV4NA$V8 < 67.5)))
[1] 30.48036
sum((table(OzV4NA$V12 < 63.59) * tapply(OzV4NA$V4, OzV4NA$V12 < 63.59, var, na.rm = TRUE)) / sum(table(OzV4NA$V12 < 63.59)))
[1] 35.96778
2.3 Surrogate splits
Still in the display obtained with summary(Tree1)
, it is question of surrogate splits. These splits are classified in descending order of the “agree” quantity (for “agreement”), representing the degree of agreement with the optimal split: the higher the value of a split, the more it sends about the same observations in the left child nodes and right as the optimal split.
- Construct the cross-table of observations from the Ozone data sent to the left or right by the optimal split and by the first surrogate split.
[Remove missing data from theV4
variable beforehand]
<- Ozone[!is.na(Ozone$V4),]
OzV4NA <- table(OzV4NA$V8 < 67.5, OzV4NA$V12 < 67.19)
agreement agreement
FALSE TRUE
FALSE 102 20
TRUE 25 198
- Compute the ratio between the number of agreements and 359.
sum(diag(agreement)) / 359
[1] 0.8356546
- Determine why it was necessary to divide by 359 in the previous question.
# We compute the ratio between the number of observations actually sent in
# the same child nodes and the number of non-missing observations of the
# variable associated with the optimal split, here V8.
2.4 Prediction when values are missing
At each node, surrogate splits are therefore calculated. These surrogate splits are used to predict an observation which has a missing value (if the value associated with the split variable is missing, the surrogate split with the most “agreement” value is used ; if it is also missing, the next surrogate split is used, etc).
- For the two-leaves tree, determine in which leaf is located the observation on line 1 of Ozone data. Explain the steps performed.
head(Ozone)
# The variable V8 is missing for this observation. So we're going to look
# at the value of V12, and compare it to 67.19, because it is the 1st
# surrogate split. Since 30.56 is less than 67.19, we know that the
# observation will therefore go in the left child node. We can check this
# with:
$where[1] Tree1
1
2
# "2" stands for the left child node of the root node.
- Build the tree of depth 2 (with 4 leaves), then determine in which leaf falls the observation in line 2 of Ozone data, by explaining the path taken to get to this leaf.
<- rpart(V4 ~ ., data = Ozone, maxdepth = 2)
Tree4l Tree4l
n=361 (5 observations effacées parce que manquantes)
node), split, n, deviance, yval
* denotes terminal node
1) root 361 22558.0000 11.526320
2) V8< 67.5 232 4416.0690 7.293103
4) V10>=3573.5 115 721.4435 5.069565 *
5) V10< 3573.5 117 2567.1970 9.478632 *
3) V8>=67.5 129 6507.4880 19.139530
6) V8< 79.5 85 3320.8940 16.564710 *
7) V8>=79.5 44 1534.4320 24.113640 *
summary(Tree4l)
Call:
rpart(formula = V4 ~ ., data = Ozone, maxdepth = 2)
n=361 (5 observations effacées parce que manquantes)
CP nsplit rel error xerror xstd
1 0.51575683 0 1.0000000 1.0037258 0.07423753
2 0.07324064 1 0.4842432 0.5503434 0.04538689
3 0.04997912 2 0.4110025 0.4824789 0.04207148
4 0.01000000 3 0.3610234 0.4330381 0.03966768
Variable importance
V8 V12 V5 V1 V10 V7 V2 V13
36 21 17 15 5 2 2 1
Node number 1: 361 observations, complexity param=0.5157568
mean=11.52632, MSE=62.48753
left son=2 (232 obs) right son=3 (129 obs)
Primary splits:
V8 < 67.5 to the left, improve=0.5146929, (2 missing)
V12 < 63.59 to the left, improve=0.4221966, (14 missing)
V10 < 3472.5 to the right, improve=0.3188287, (15 missing)
V9 < 62.87 to the left, improve=0.3174874, (137 missing)
V1 splits as LLLLRRRRRRLL, improve=0.3079652, (0 missing)
Surrogate splits:
V12 < 67.19 to the left, agree=0.836, adj=0.543, (2 split)
V5 < 5795 to the left, agree=0.813, adj=0.481, (0 split)
V1 splits as LLLLRRRRRLLL, agree=0.808, adj=0.465, (0 split)
V7 < 70.5 to the left, agree=0.669, adj=0.078, (0 split)
V10 < 2396 to the right, agree=0.666, adj=0.070, (0 split)
Node number 2: 232 observations, complexity param=0.04997912
mean=7.293103, MSE=19.03478
left son=4 (115 obs) right son=5 (117 obs)
Primary splits:
V10 < 3573.5 to the right, improve=0.2581106, (8 missing)
V12 < 53.15 to the left, improve=0.2442090, (7 missing)
V8 < 58.5 to the left, improve=0.2128880, (2 missing)
V1 splits as LRRRRR-RLRRL, improve=0.1218698, (0 missing)
V9 < 56.21 to the left, improve=0.1146961, (80 missing)
Surrogate splits:
V12 < 52.79 to the left, agree=0.875, adj=0.745, (1 split)
V5 < 5675 to the left, agree=0.683, adj=0.355, (7 split)
V2 splits as LLLRLLRLLRLLRLLLLRRRLLLRRLLLRRR, agree=0.643, adj=0.273, (0 split)
V8 < 56.5 to the left, agree=0.638, adj=0.264, (0 split)
V13 < 65 to the right, agree=0.634, adj=0.255, (0 split)
Node number 3: 129 observations, complexity param=0.07324064
mean=19.13953, MSE=50.44565
left son=6 (85 obs) right son=7 (44 obs)
Primary splits:
V8 < 79.5 to the left, improve=0.2538863, (0 missing)
V12 < 72.77 to the left, improve=0.2307975, (7 missing)
V9 < 70.97 to the left, improve=0.1900417, (57 missing)
V13 < 145 to the right, improve=0.1851993, (0 missing)
V11 < -25.5 to the left, improve=0.1775629, (0 missing)
Surrogate splits:
V12 < 75.65 to the left, agree=0.798, adj=0.409, (0 split)
V2 splits as LLLLLLLLLLLLLLRLLLRLLLLRRRRRRLL, agree=0.775, adj=0.341, (0 split)
V5 < 5865 to the left, agree=0.775, adj=0.341, (0 split)
V1 splits as L-LLLLRLLLL-, agree=0.682, adj=0.068, (0 split)
V3 splits as LLRLLLL, agree=0.674, adj=0.045, (0 split)
Node number 4: 115 observations
mean=5.069565, MSE=6.273422
Node number 5: 117 observations
mean=9.478632, MSE=21.94185
Node number 6: 85 observations
mean=16.56471, MSE=39.06934
Node number 7: 44 observations
mean=24.11364, MSE=34.87345
head(Ozone)
# At the first node (the root), we have a value for V8 (38) which makes the
# observation go forward in the left child node (node number 2).
# Arrival at node 2, as the value of V10 is missing, we must look at
# surrogate splits. The value of V12 is also missing, however
# V5's is not. We have 5660 < 5675, so the observation will finally go
# in the left child node of node 2, i.e. node 4.
$where[2] Tree4l
2
3
# Here the result is 3, because it is the number of the table line:
$frame Tree4l
# In line 3, we have node 4 (which is a leaf, while node 3
# corresponds to the right child node of the root node and is on line 5).
2.5 Variable importance
Surrogate splits are also used to compute variable importance scores. For a given variable \(X^j\), we sum all the gains of homogeneity for nodes where \(X^j\) is actually the optimal split variable, and for nodes where this is not the case, we sums the homogeneity gains associated with surrogate splits using \(X^j\).
- Build the maximum tree, then display the importance of variables.
<- rpart(V4 ~ ., data = Ozone, minsplit = 2, cp = 0)
arbre_max <- arbre_max$variable.importance
imp_cart imp_cart
V8 V12 V1 V5 V2 V10 V11
14555.1689 8948.3863 8010.5991 7723.9996 4884.6418 2529.3148 1875.4416
V7 V3 V13 V6 V9
1590.6231 1159.2380 1122.9561 604.1321 154.4161
- Propose a graphical representation of theses variables importance scores.
barplot(imp_cart)
- Repeat the two previous questions by removing missing data at the time of maximal tree building. Comment by explaining what happens with the variable
V9
.
<- rpart(V4 ~ ., data = Ozone, minsplit = 2, cp = 0, na.action = na.omit)
arbre_max_naomit <- arbre_max_naomit$variable.importance
imp_cart_naomit barplot(imp_cart_naomit)
# V9 is now in 1st position for the CART tree. Recall that this variable
# contains a large number of missing data.
# In conclusion, the importance of the variables calculated by CART gives
# results consistent with the application here. In any case, it's much more
# satisfying than thinking that the most important variables are "only" those
# that appear in the splits of the optimal tree.
# The only disadvantage of this importance is that, as we see here for V2,
# it tends to bias the importance of qualitative variables with a large
# number of levels.