--- title: "CART practice" author: "Mathias Bourel" output: html_document: default pdf_document: default --- Install and load rpart package. Another package to make tree is the tree package. Install partykit package (this package makes better picture and have a more suitable interfase) Look at rpart.control to see the different parameters: ```{r} rm(list=ls()) library(rpart) library(partykit) ?rpart ?rpart.control #rpart.control(minsplit = 20, minbucket = round(minsplit/3), cp = 0.01, # maxcompete = 4, maxsurrogate = 5, usesurrogate = 2, xval = 10, # surrogatestyle = 0, maxdepth = 30) ``` ##CLASSIFICATION TREES We use Iris data ```{r} attach(iris) summary(iris) ``` #CART ```{r} model=rpart(Species~.,data=iris) model ``` Take the time to try to understand what is displayed. ```{r} plot(model,margin=0.1) text(model) model.cl=rpart(Species~.,cp=0.001,minsplit=5,iris) summary(model.cl) model.cl ``` Take the time to try to understand what is displayed. rpart uses a default cp value of 0.01 if you don't specify one in prune. ```{r} x11() par(mfrow=c(3,2)) plot(model.cl,margin=0.1) text(model.cl,all=T) plot(model.cl, uniform=T) text(model.cl, use.n=T, all=T) plot(model.cl, branch=0) text(model.cl, use.n=T,col="red") plot(model.cl, branch=.7) text(model.cl, use.n=T) plot(model.cl, branch=.4, uniform=T, compress=T) text(model.cl, all=T,use.n=T) plot(model.cl, branch=.2, uniform=T, compress=T, margin=.1) text(model.cl, all=T, use.n=T, fancy=T) ``` try to understand the different graphical parameters another way to see the output ```{r} rparty.tree = as.party(model.cl) plot(rparty.tree) ``` Classification error over the train sample ```{r} pred=predict(model.cl,type="class",iris) table(pred,true=iris[,5]) error=3/150 error #or error = mean(pred!= iris[,5]) error ``` Classification error over newdata ```{r} newdata = rbind(c(5,3.45,1,0.2),c(5.8,3.5,5.11,2)) dimnames(newdata)=list(NULL, c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")) newdatas=data.frame(newdata) pred=predict(model.cl,type="class",newdata=newdatas) pred pred=predict(model.cl,type="class",newdata=newdatas) pred pred=predict(model.cl,type="prob",newdata=newdatas) pred pred=predict(model.cl,type="vector",newdata=newdatas) pred ``` Classification error over a test sample Divide the dataset 30 times randomly in train and test samples. For each split, fit a classification tree with the train sample and compute its prediction over the test sample. At the end, compute the mean, with standard deviation, of these errors. ```{r} K=30 error.cl=NULL n = nrow(iris) for(k in 1:K) { smp=sample(n,round(n/3)) learn=iris[-smp,] test=iris[smp,] model.cl.learn=rpart(Species~.,cp=0.001,minsplit=5,data=learn) pred.cl=predict(model.cl.learn,type="class",test) error.cl[k] = mean(pred.cl!= test[,5]) } mean.error.cl=mean(error.cl) sd.error.cl=sd(error.cl) mean.error.cl sd.error.cl ``` Pruning ```{r} plotcp(model.cl) printcp(model.cl) ``` According to the criterion of least error by cross validation we are left with the tree 5. According to criterion 1-SE: with what tree do we stay? 1-SE Rule: We are left with the simplest tree that has a minor error at the lowest error by VC + 1se (xerror + xstd). BE CAREFUL: THESE RESULTS MAY VARY FOR THE DIFFERENT STEPS BECAUSE WHEN CALCULATING THE ERROR VALIDATION CROSSED, WE ARE SEPARATING THE SAMPLE RANDOMLY AND IT CAN BRING DIFFERENT RESULTS. Contents of the printcp table: Remember that this table is normalized so that the error in the root node is 1 cp: cost-complexity parameter (as we have said before it is alpha / error by resolution in root node) nsplit: number of divisions (nsplit + 1 = number of terminal nodes) rel error: relative error xerror: error by cross validation xstd: standard deviation (it helps us analyze the 1-SE rule) Let see (as an exercise) what happen if I choose to prune the tree to keep the sub tree with cp = 0.010. ```{r} model.cl.pod=prune(model.cl,cp=0.010) plot(model.cl.pod,margin=0.1) text(model.cl.pod, use.n=T,col="blue") ``` Classification error of the prunned tree ```{r} K=30 error.cl.pod=NULL n = nrow(iris) for(k in 1:K) { smp=sample(n,round(n/3)) learn=iris[-smp,] test=iris[smp,] model.cl.pod.learn=rpart(Species~.,cp=0.010,learn) pred.cl.pod.learn=predict(model.cl.pod.learn,type="class",test) error.cl.pod[k] = error = mean(pred.cl.pod.learn!= test[,5])} mean.error.cl.pod=mean(error.cl.pod) sd.error.cl.pod=sd(error.cl.pod) mean.error.cl.pod sd.error.cl.pod mean.error.cl sd.error.cl ``` ## 2- Regression trees We use airquality data ```{r} data=airquality attach(data) model.rg=rpart(Ozone~Solar.R+Wind+Temp,data=data) model.rg summary(model.rg) plot(model.rg,margin=0.1) text(model.rg, use.n=T) ``` Compute pseudo-R2 (deviance root node - deviance tree)/desviance root node where do you read deviance for model.rg? Predictions ```{r} predict(model.rg) newdata = rbind(c(315,12,60),c(270,9,65)) dimnames(newdata)=list(NULL, c("Solar.R","Wind","Temp")) newdata=data.frame(newdata) pred=predict(model.rg,newdata) pred ``` Kept the tree according to the cp criterion (crossvalidation and 1-SE Rule). ```{r} printcp(model.rg) plotcp(model.rg) ``` Here some ways to select cp: ```{r} #With crossvalidation cp.optx = model.rg$cptable[which.min(model.rg$cptable[,"xerror"]),"CP"] cp.optx #with 1-SErule xerror=model.rg$cptable[,4] xstd <- model.rg$cptable[, 5] t.opt <- min(seq(along = xerror)[xerror <= min(xerror) + xstd[which.min(xerror)]]) cp.opt1SE=model.rg$cptable[t.opt,1] cp.opt1SE #We kept with the tree which satisfies 1SE rule model.rg.pr=prune(model.rg,cp=cp.opt1SE) plot(model.rg.pr,branch=0.1,margin=0.1) text(model.rg.pr, use.n=T,col="orange") ``` Classification error ```{r} K=30 error.rg.pr=NULL n = nrow(data) for(k in 1:K) { smp=sample(n,round(n/3)) learn=data[-smp,] learn=learn[,1:4] test=data[smp,] test=test[,1:4] model.rg.pr.learn=rpart(Ozone~Solar.R+Wind+Temp,learn) pred= predict (model.rg.pr.learn,test) sin.na=na.omit(cbind(test[,1],pred)) error.rg.pr[k] = sqrt(mean((sin.na[,1]-sin.na[,2])^2)) } mean.error.rg.pr=mean(error.rg.pr) sd.error.rg.pr=sd(error.rg.pr) mean.error.rg.pr sd.error.rg.pr ``` The final model is the one build over all data!