rpart
, and its function for constructing trees is called rpart()
. To install the rpart
package, click Install on the Packages tab and type rpart in the Install Packages dialog box. Then, in the dialog box, click the Install button. After the package downloads, find rpart
in the Packages tab and click to select its check box.
Growing the tree in R
To create a decision tree for theiris.uci
data frame, use the following code:
library(rpart) iris.tree <- rpart(species ~ sepal.length + sepal.width + petal.length + petal.width, iris.uci, method="class")The first argument to
rpart()
is a formula indicating that species
depends on the other four variables. [The tilde (~) means “depends on.”> The second argument is the data frame you’re using. The method = “class”
argument (it’s the third one) tells rpart()
that this is a classification tree. (For a regression tree, it’s method = “anova”
.)
You can abbreviate the whole right side of the formula with a period. So the shorthand version is
species ~ .
The left side of the code, iris.tree
, is called an rpart object. So rpart()
creates an rpart object.
At this point, you can type the rpart object
iris.tree
and see text output that describes the tree:
n= 150The first line indicates that this tree is based on 150 cases. The second line provides a key for understanding the output. The third line tells you that an asterisk denotes that a node is a leaf.node), split, n, loss, yval, (yprob) * denotes terminal node
1) root 150 100 setosa (0.33333333 0.33333333 0.33333333) 2) petal.length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00000000) * 3) petal.length>=2.45 100 50 versicolor (0.00000000 0.50000000 0.50000000) 6) petal.width< 1.75 54 5 versicolor (0.00000000 0.90740741 0.09259259) * 7) petal.width>=1.75 46 1 virginica (0.00000000 0.02173913 0.97826087) *
Each row corresponds to a node on the tree. The first entry in the row is the node number followed by a right parenthesis. The second is the variable and the value that make up the split. The third is the number of classified cases at that node. The fourth, loss
, is the number of misclassified cases at the node. Misclassified? Compared to what? Compared to the next entry, yval
, which is the tree’s best guess of the species at that node. The final entry is a parenthesized set of proportions that correspond to the proportion of each species at the node.
You can see the perfect classification in node 2, where loss
(misclassification) is 0. By contrast, in nodes 6 and 7 loss
is not 0. Also, unlike node 2, the parenthesized proportions for nodes 6 and 7 do not show 1.00 in the slots that represent the correct species. So the classification rules for versicolor and virginica result in small amounts of error.
Drawing the tree in R
Now you plot the decision tree, and you can see how it corresponds to therpart()
output. You do this with a function called prp()
, which lives in the rpart.plot
package.
The rpart
package has a function called plot.rpart()
, which is supposed to plot a decision tree. You might find that your version of R can’t find it. It can find the function’s documentation via ?plot.rpart
but it can’t find the function. Weird.
rpart.plot
installed, here’s the code that plots the tree below:
library(rpart.plot) prp(iris.tree,type=2,extra="auto",nn = TRUE,branch=1,varlen=0,yesno=2)
The first argument to prp() is the rpart object. That’s the only argument that’s necessary. Think of the rpart object as a set of specifications for plotting the tree. You can add the other arguments to make the plot prettier:
type = 2
means “label all the nodes”extra = “auto”
tellsprp()
to include the information you see in each rounded rectangle that’s in addition to the species namenn = TRUE
puts the node-number on each nodebranch = 1
indicates the lines-with-corners style of branching. These are called “square-shouldered branches”, believe it or not. For slump-shouldered branches (I made that up) try a value between 0 and 1varlen=0
produces the full variable names on all the nodes (instead of names truncated to 8 characters)yesno=2
putsyes
orno
on all the appropriate branches (instead of just the ones descending from the root, which is the default). Note that each left branch isyes
and each right branch isno
At the root, the proportions are .33 for each species, and 100 percent of the data is at the root. The split (petal.length < 2.4
) puts 33 percent of the data at the setosa leaf and 67 percent at the internal node. The setosa leaf shows the proportions 1.00
, .00
, and .00
, indicating that all the cases at that leaf are perfectly classified as setosas.
The internal node shows .00
, .50
, and .50
, which means none of these cases are setosas, half are versicolor, and half are virginica. The internal node split (petal.width < 1.8
) puts 36 percent of the cases into the versicolor leaf and the 31 percent of the cases into the virginica leaf. Already this shows a problem: With perfect classification those percentages would be equal, because each species shows up equally in the data.
On the versicolor leaf, the proportions are .00
, .91
, and .09
. This means 9 percent of cases classified as versicolor are actually virginica. On the virginica leaf, the proportions are .00
, .02
, and .98
. So 2 percent of the cases classified as virginica are really versicolor.
Bottom line: For the great majority of the 150 cases in the data, the classification rules in the decision tree get the job done. But the rules aren’t perfect, which is typically the case with a decision tree.