Another task in predictive analytics is to classify new data by predicting what class a target item of data belongs to, given a set of independent variables. You can, for example, classify a customer by type – say, as a high-value customer, a regular customer, or a customer who is ready to switch to a competitor – by using a decision tree.
To see some useful information about the R Classification model, type in the following code:
> summary(model) Length Class Mode 1 BinaryTree S4
The Class column tells you that you’ve created a decision tree. To see how the splits are being determined, you can simply type in the name of the variable in which you assigned the model, in this case model, like this:
> model Conditional inference tree with 6 terminal nodes Response: seedType Inputs: area, perimeter, compactness, length, width, asymmetry, length2 Number of observations: 147 1) area <= 16.2; criterion = 1, statistic = 123.423 2) area <= 13.37; criterion = 1, statistic = 63.549 3) length2 <= 4.914; criterion = 1, statistic = 22.251 4)* weights = 11 3) length2 > 4.914 5)* weights = 45 2) area > 13.37 6) length2 <= 5.396; criterion = 1, statistic = 16.31 7)* weights = 33 6) length2 > 5.396 8)* weights = 8 1) area > 16.2 9) length2 <= 5.877; criterion = 0.979, statistic = 8.764 10)* weights = 10 9) length2 > 5.877 11)* weights = 40
Even better, you can visualize the model by creating a plot of the decision tree with this code:> plot(model)
This is a graphical representation of a decision tree. You can see that the overall shape mimics that of a real tree. It’s made of nodes (the circles and rectangles) and links or edges (the connecting lines).
The very first node (starting at the top) is called the root node and the nodes at the bottom of the tree (rectangles) are called terminal nodes. There are five decision nodes and six terminal nodes.
At each node, the model makes a decision based on the criteria in the circle and the links, and chooses a way to go. When the model reaches a terminal node, a verdict or a final decision is reached. In this particular case, two attributes, the and the , are used to decide whether a given seed type is in class 1, 2 or 3.
For example, take observation #2 from the dataset. It has a of 4.956 and an of 14.88. You can use the tree you just built to decide which particular seed type this observation belongs to. Here’s the sequence of steps:
Start at the root node, which is node 1 (the number is shown in the small square at the top of the circle). Decide based on the attribute: Is the of observation #2 less than or equal to (denoted by <=) 16.2? The answer is yes, so move along the path to node 2.
At node 2, the model asks: Is the area <= 13.37? The answer is no, so try the next link which asks: Is the area > 13.37? The answer is yes, so move along the path to node 6. At this node the model asks: Is the length2 <= 5.396? It is, and you move to terminal node 7 and the verdict is that observation #2 is of seed type 1. And it is, in fact, seed type 1.
The model does that process for all other observations to predict their classes.
To find out whether you trained a good model, check it against the training data. You can view the results in a table with the following code:
> table(predict(model),trainSet$seedType) 1 2 3 1 45 4 3 2 3 47 0 3 1 0 44
The results show that the error (or misclassification rate) is 11 out of 147, or 7.48 percent.
With the results calculated, the next step is to read the table.
The correct predictions are the ones that show the column and row numbers as the same. Those results show up as a diagonal line from top-left to bottom-right; for example, [1,1], [2,2], [3,3] are the number of correct predictions for that class.
So for seed type 1, the model correctly predicted it 45 times, while misclassifying the seed 7 times (4 times as seed type 2, and 3 times as type 3). For seed type 2, the model correctly predicted it 47 times, while misclassifying it 3 times. For seed type 3, the model correctly predicted it 44 times, while misclassifying it only once.
This shows that this is a good model. So now you evaluate it with the test data. Here is the code that uses the test data to predict and store it in a variable (testPrediction) for later use:
> testPrediction <- predict(model, newdata=testSet)
To evaluate how the model performed with the test data, view it in a table and calculate the error, for which the code looks like this:
> table(testPrediction, testSet$seedType) testPrediction 1 2 3 1 23 2 1 2 1 19 0 3 1 0 17
The results show that the error is 5 out of 64, or 7.81 percent. This is consistent with the training data.