Practice: CART trees in regression

ECAS-SFdS 2023 School

Author

Robin Genuer

Published

September 27, 2023

1 Regression tree on Ozone data

We consider a dataset on Ozone pollution in Los Angeles in 1976.

Load the dataset, available in the mlbench package, using the following commands:

data(Ozone, package = "mlbench")
# if the package is not yet installed, execute:
# install.packages("mlbench")
# str() gives the dataset structure
str(Ozone)

Consult the help page to get a full description of the dataset:

help(Ozone, package = "mlbench")

In this dataset, we have:

  • one continuous output variable (V4)

  • 12 input variables, which are either categorical (for example the month of the year, V1) or continuous (for example the wind speed at Los Angeles airport, V6).

1.1 Intialization

Load the rpart package (for recursive partitioning), then read the beginning of rpart() function help:

library("rpart")
help(rpart)

1.2 Tree building.

  1. Build one regression tree Tree with the rpart() function (be careful that the function requires for its first argument a formula as: \(\mathtt{y \sim x}\) and a dataset for its second argument, we do not consider the other parameters for now).
Tree <- rpart(V4~., data = Ozone)
  1. Print the tree by just executing Tree and plot the tree using plot(t) followed by text(t) (the graphics device must be active in order to the text() function to work). If the plot is not satisfying, try to set the xpd (for “expand”) parameter of the text() function to TRUE.
Tree
plot(Tree)
text(Tree, xpd = TRUE)

Remark: in Quarto/Rmarkdown chunk, the entire chunk must be executed at once, and not line by line (otherwise the text() function cannot run).

  1. Using the predict() function, predict data of the learning set. The predict() function requires the tree object and the name of the dataset to predict (look at the predict.rpart() help page).
    Take a look at the prediction value of the 3rd observation of Ozone dataframe and retrieve its path trough the tree. Compute the empirical error of Tree.
    [Take care of the missing values of V4]
TreePred <- predict(object = Tree, newdata = Ozone)
TreePred[3]
       3 
5.424242 
head(Ozone)
# For the 3rd observation:
# V8 = 40 < 67.5
# V10 = 2693 < 3574
# V1 = 1 (January) belongs to "al" (January, December)
# Hence the observation falls in the leaf the second most on the left.
empErr <- mean((Ozone$V4 - TreePred)^2)
empErr
[1] NA
# we get NA because of the missing values in V4
# we can remove missing values inside the mean() function:
empErr <- mean((Ozone$V4 - TreePred)^2, na.rm = TRUE)
empErr
[1] 12.56272
# or remove them by hand:
naV4 <- which(is.na(Ozone$V4))
empErr2 <- mean((Ozone$V4[-naV4] - TreePred[-naV4])^2)
empErr2
[1] 12.56272

Remark: The behavior of the predict() function is different if you don’t enter the newdata parameter. Execute predict(object = Tree) and calculate the length of the obtained predictions vector.

1.3 Building parameters

Determine which tree is built when the following commands are executed (see the rpart.control() help page for more information on parameters of the tree building):

Tree1 <- rpart(V4 ~ ., data = Ozone, maxdepth = 1)
Tree2 <- rpart(V4 ~ ., data = Ozone, minsplit = 2, cp = 0)

1.4 Empirical errors

Print and plot Tree1 and Tree2 and compute their empirical errors.

Tree1
n=361 (5 observations effacées parce que manquantes)

node), split, n, deviance, yval
      * denotes terminal node

1) root 361 22558.000 11.526320  
  2) V8< 67.5 232  4416.069  7.293103 *
  3) V8>=67.5 129  6507.488 19.139530 *
plot(Tree1) ; text(Tree1, xpd = TRUE)

Tree1Pred <- predict(Tree1, Ozone)
empErr1 <- mean((Ozone$V4 - Tree1Pred)^2, na.rm = TRUE)
empErr1
[1] 30.25916
Tree2
plot(Tree2)

# text(Tree2)
Tree2Pred <- predict(Tree2, Ozone)
empErr2 <- mean((Ozone$V4 - Tree2Pred)^2, na.rm = TRUE)
empErr2
[1] 0

Remark: We perfectly predict learning sample observations here. This illustrates the fact that the maximal tree largely over-fits the data.

1.5 Pruning

Let us start by renaming Tree2 as TreeMax to be more explicit:

TreeMax <- Tree2
  1. From the maximal tree object, print the results of the nested pruned sub-trees (look at rpart.object help page).
TreeMax$cptable
  1. Determine the complexity value corresponding to the minimum of the cross-validation error (indicated in the xerror column).
indCPopt <- which.min(TreeMax$cptable[, "xerror"])
indCPopt
7 
7 
CPopt <- TreeMax$cptable[indCPopt, "CP"]
CPopt
[1] 0.02300046
  1. Prune the maximal tree using the complexity value just found using the prune() function.
TreeOpt <- prune(TreeMax, cp = CPopt)
plot(TreeOpt)
text(TreeOpt, xpd = TRUE)

2 Diving into CART details

Let us continue further on Ozone data.

2.1 yval

  1. Let start again with Tree1 the CART tree with only two leaves. Print the resulting tree Tree1.
Tree1
n=361 (5 observations effacées parce que manquantes)

node), split, n, deviance, yval
      * denotes terminal node

1) root 361 22558.000 11.526320  
  2) V8< 67.5 232  4416.069  7.293103 *
  3) V8>=67.5 129  6507.488 19.139530 *
  1. For each node, the last information given in the print, yval, is the value of \(Y\) associated to the current node. Check that those values of yval are approximately what you expected (with a precision of \(10^{-2}\)).
    [Be aware of missing values]
tapply(Ozone$V4, Ozone$V8 < 67.5, mean, na.rm = TRUE)
    FALSE      TRUE 
19.139535  7.286957 
  1. Determine why those values are not exactly the same at a precision of \(10^{-6}\).
# There are some missing values for V8, but the tree assigns them anyway
# to leaves, as it can be seen in where component of object Tree1:
tapply(na.omit(Ozone$V4), Tree1$where == 2, mean, na.rm = TRUE)
    FALSE      TRUE 
19.139535  7.293103 

2.2 Competing splits

  1. Apply the summary() function to Tree1.
summary(Tree1)
Call:
rpart(formula = V4 ~ ., data = Ozone, maxdepth = 1)
  n=361 (5 observations effacées parce que manquantes)

         CP nsplit rel error   xerror       xstd
1 0.5157568      0 1.0000000 1.007908 0.07477639
2 0.0100000      1 0.4842432 0.529678 0.04419110

Variable importance
 V8 V12  V5  V1  V7 V10 
 38  21  18  18   3   3 

Node number 1: 361 observations,    complexity param=0.5157568
  mean=11.52632, MSE=62.48753 
  left son=2 (232 obs) right son=3 (129 obs)
  Primary splits:
      V8  < 67.5   to the left,    improve=0.5146929, (2 missing)
      V12 < 63.59  to the left,    improve=0.4221966, (14 missing)
      V10 < 3472.5 to the right,   improve=0.3188287, (15 missing)
      V9  < 62.87  to the left,    improve=0.3174874, (137 missing)
      V1  splits as  LLLLRRRRRRLL, improve=0.3079652, (0 missing)
  Surrogate splits:
      V12 < 67.19  to the left,    agree=0.836, adj=0.543, (2 split)
      V5  < 5795   to the left,    agree=0.813, adj=0.481, (0 split)
      V1  splits as  LLLLRRRRRLLL, agree=0.808, adj=0.465, (0 split)
      V7  < 70.5   to the left,    agree=0.669, adj=0.078, (0 split)
      V10 < 2396   to the right,   agree=0.666, adj=0.070, (0 split)

Node number 2: 232 observations
  mean=7.293103, MSE=19.03478 

Node number 3: 129 observations
  mean=19.13953, MSE=50.44565 
  1. In the result, the “improve” quantity correspond to the proportion of gain in terms of homogeneity brought by the split. Hence, the more the “improve” value is, the more the decrease in in-node variance of child nodes is.
  1. Explain what is the split “V12 < 63.59” and why it is printed at this place.
# The split "V12 < 63.59" is the second in decreasing order of gain of
# homogeneity, i.e. this is the split that would have been chosen if
# V8 was not available in the data.
  1. Compute variances of the child nodes effectively built. Then compute variances of child nodes that would have been built if the split “V12 < 63.59” was used.
tapply(Ozone$V4, Ozone$V8 < 67.5, var, na.rm = TRUE)
   FALSE     TRUE 
50.83975 19.06140 
tapply(Ozone$V4, Ozone$V12 < 63.59, var, na.rm = TRUE)
   FALSE     TRUE 
61.14879 14.67107 
# We see that the right child node variance is really smaller for the
# optimal split, whereas the left child node one is larger.
# To check that the homogeneity gain is effectively larger with the optimal
# split, a weighted mean of the two variances must be computed.
OzV4NA <- Ozone[!is.na(Ozone$V4),]
sum((table(OzV4NA$V8 < 67.5) * tapply(OzV4NA$V4, OzV4NA$V8 < 67.5, var, na.rm = TRUE)) / sum(table(OzV4NA$V8 < 67.5)))
[1] 30.48036
sum((table(OzV4NA$V12 < 63.59) * tapply(OzV4NA$V4, OzV4NA$V12 < 63.59, var, na.rm = TRUE)) / sum(table(OzV4NA$V12 < 63.59)))
[1] 35.96778

2.3 Surrogate splits

Still in the display obtained with summary(Tree1), it is question of surrogate splits. These splits are classified in descending order of the “agree” quantity (for “agreement”), representing the degree of agreement with the optimal split: the higher the value of a split, the more it sends about the same observations in the left child nodes and right as the optimal split.

  1. Construct the cross-table of observations from the Ozone data sent to the left or right by the optimal split and by the first surrogate split.
    [Remove missing data from the V4 variable beforehand]
OzV4NA <- Ozone[!is.na(Ozone$V4),]
agreement <- table(OzV4NA$V8 < 67.5, OzV4NA$V12 < 67.19)
agreement
       
        FALSE TRUE
  FALSE   102   20
  TRUE     25  198
  1. Compute the ratio between the number of agreements and 359.
sum(diag(agreement)) / 359
[1] 0.8356546
  1. Determine why it was necessary to divide by 359 in the previous question.
# We compute the ratio between the number of observations actually sent in
# the same child nodes and the number of non-missing observations of the
# variable associated with the optimal split, here V8.

2.4 Prediction when values are missing

At each node, surrogate splits are therefore calculated. These surrogate splits are used to predict an observation which has a missing value (if the value associated with the split variable is missing, the surrogate split with the most “agreement” value is used ; if it is also missing, the next surrogate split is used, etc).

  1. For the two-leaves tree, determine in which leaf is located the observation on line 1 of Ozone data. Explain the steps performed.
head(Ozone)
# The variable V8 is missing for this observation. So we're going to look
# at the value of V12, and compare it to 67.19, because it is the 1st
# surrogate split. Since 30.56 is less than 67.19, we know that the
# observation will therefore go in the left child node. We can check this
# with:
Tree1$where[1]
1 
2 
# "2" stands for the left child node of the root node.
  1. Build the tree of depth 2 (with 4 leaves), then determine in which leaf falls the observation in line 2 of Ozone data, by explaining the path taken to get to this leaf.
Tree4l <- rpart(V4 ~ ., data = Ozone, maxdepth = 2)
Tree4l
n=361 (5 observations effacées parce que manquantes)

node), split, n, deviance, yval
      * denotes terminal node

1) root 361 22558.0000 11.526320  
  2) V8< 67.5 232  4416.0690  7.293103  
    4) V10>=3573.5 115   721.4435  5.069565 *
    5) V10< 3573.5 117  2567.1970  9.478632 *
  3) V8>=67.5 129  6507.4880 19.139530  
    6) V8< 79.5 85  3320.8940 16.564710 *
    7) V8>=79.5 44  1534.4320 24.113640 *
summary(Tree4l)
Call:
rpart(formula = V4 ~ ., data = Ozone, maxdepth = 2)
  n=361 (5 observations effacées parce que manquantes)

          CP nsplit rel error    xerror       xstd
1 0.51575683      0 1.0000000 1.0037258 0.07423753
2 0.07324064      1 0.4842432 0.5503434 0.04538689
3 0.04997912      2 0.4110025 0.4824789 0.04207148
4 0.01000000      3 0.3610234 0.4330381 0.03966768

Variable importance
 V8 V12  V5  V1 V10  V7  V2 V13 
 36  21  17  15   5   2   2   1 

Node number 1: 361 observations,    complexity param=0.5157568
  mean=11.52632, MSE=62.48753 
  left son=2 (232 obs) right son=3 (129 obs)
  Primary splits:
      V8  < 67.5   to the left,    improve=0.5146929, (2 missing)
      V12 < 63.59  to the left,    improve=0.4221966, (14 missing)
      V10 < 3472.5 to the right,   improve=0.3188287, (15 missing)
      V9  < 62.87  to the left,    improve=0.3174874, (137 missing)
      V1  splits as  LLLLRRRRRRLL, improve=0.3079652, (0 missing)
  Surrogate splits:
      V12 < 67.19  to the left,    agree=0.836, adj=0.543, (2 split)
      V5  < 5795   to the left,    agree=0.813, adj=0.481, (0 split)
      V1  splits as  LLLLRRRRRLLL, agree=0.808, adj=0.465, (0 split)
      V7  < 70.5   to the left,    agree=0.669, adj=0.078, (0 split)
      V10 < 2396   to the right,   agree=0.666, adj=0.070, (0 split)

Node number 2: 232 observations,    complexity param=0.04997912
  mean=7.293103, MSE=19.03478 
  left son=4 (115 obs) right son=5 (117 obs)
  Primary splits:
      V10 < 3573.5 to the right,   improve=0.2581106, (8 missing)
      V12 < 53.15  to the left,    improve=0.2442090, (7 missing)
      V8  < 58.5   to the left,    improve=0.2128880, (2 missing)
      V1  splits as  LRRRRR-RLRRL, improve=0.1218698, (0 missing)
      V9  < 56.21  to the left,    improve=0.1146961, (80 missing)
  Surrogate splits:
      V12 < 52.79  to the left,  agree=0.875, adj=0.745, (1 split)
      V5  < 5675   to the left,  agree=0.683, adj=0.355, (7 split)
      V2  splits as  LLLRLLRLLRLLRLLLLRRRLLLRRLLLRRR, agree=0.643, adj=0.273, (0 split)
      V8  < 56.5   to the left,  agree=0.638, adj=0.264, (0 split)
      V13 < 65     to the right, agree=0.634, adj=0.255, (0 split)

Node number 3: 129 observations,    complexity param=0.07324064
  mean=19.13953, MSE=50.44565 
  left son=6 (85 obs) right son=7 (44 obs)
  Primary splits:
      V8  < 79.5   to the left,  improve=0.2538863, (0 missing)
      V12 < 72.77  to the left,  improve=0.2307975, (7 missing)
      V9  < 70.97  to the left,  improve=0.1900417, (57 missing)
      V13 < 145    to the right, improve=0.1851993, (0 missing)
      V11 < -25.5  to the left,  improve=0.1775629, (0 missing)
  Surrogate splits:
      V12 < 75.65  to the left,  agree=0.798, adj=0.409, (0 split)
      V2  splits as  LLLLLLLLLLLLLLRLLLRLLLLRRRRRRLL, agree=0.775, adj=0.341, (0 split)
      V5  < 5865   to the left,  agree=0.775, adj=0.341, (0 split)
      V1  splits as  L-LLLLRLLLL-, agree=0.682, adj=0.068, (0 split)
      V3  splits as  LLRLLLL, agree=0.674, adj=0.045, (0 split)

Node number 4: 115 observations
  mean=5.069565, MSE=6.273422 

Node number 5: 117 observations
  mean=9.478632, MSE=21.94185 

Node number 6: 85 observations
  mean=16.56471, MSE=39.06934 

Node number 7: 44 observations
  mean=24.11364, MSE=34.87345 
head(Ozone)
# At the first node (the root), we have a value for V8 (38) which makes the
# observation go forward in the left child node (node number 2).
# Arrival at node 2, as the value of V10 is missing, we must look at
# surrogate splits. The value of V12 is also missing, however
# V5's is not. We have 5660 < 5675, so the observation will finally go
# in the left child node of node 2, i.e. node 4.
Tree4l$where[2]
2 
3 
# Here the result is 3, because it is the number of the table line:
Tree4l$frame
# In line 3, we have node 4 (which is a leaf, while node 3
# corresponds to the right child node of the root node and is on line 5).

2.5 Variable importance

Surrogate splits are also used to compute variable importance scores. For a given variable \(X^j\), we sum all the gains of homogeneity for nodes where \(X^j\) is actually the optimal split variable, and for nodes where this is not the case, we sums the homogeneity gains associated with surrogate splits using \(X^j\).

  1. Build the maximum tree, then display the importance of variables.
arbre_max <- rpart(V4 ~ ., data = Ozone, minsplit = 2, cp = 0)
imp_cart <- arbre_max$variable.importance
imp_cart
        V8        V12         V1         V5         V2        V10        V11 
14555.1689  8948.3863  8010.5991  7723.9996  4884.6418  2529.3148  1875.4416 
        V7         V3        V13         V6         V9 
 1590.6231  1159.2380  1122.9561   604.1321   154.4161 
  1. Propose a graphical representation of theses variables importance scores.
barplot(imp_cart)

  1. Repeat the two previous questions by removing missing data at the time of maximal tree building. Comment by explaining what happens with the variable V9.
arbre_max_naomit <- rpart(V4 ~ ., data = Ozone, minsplit = 2, cp = 0, na.action = na.omit)
imp_cart_naomit <- arbre_max_naomit$variable.importance
barplot(imp_cart_naomit)

# V9 is now in 1st position for the CART tree. Recall that this variable
# contains a large number of missing data.
# In conclusion, the importance of the variables calculated by CART gives
# results consistent with the application here. In any case, it's much more
# satisfying than thinking that the most important variables are "only" those
# that appear in the splits of the optimal tree.
# The only disadvantage of this importance is that, as we see here for V2,
# it tends to bias the importance of qualitative variables with a large
# number of levels.