The UCI ML repository has a dataset of mushrooms with lots and lots of instances (8,124 of them) and 22 attributes. The target variable indicates whether the mushroom is edible (e) or poisonous (p).
You create an R data frame by navigating to the Data Folder, finding the .csv
data file, and then pressing Ctrl+A to select all data and Ctrl+C to copy it to the clipboard. Then this line does the trick:
mushroom.uci <- read.csv("clipboard", header=FALSE)
A word of advice: The attribute names are long and involved, so for this project only, don’t bother naming the columns unless you really and truly want to. Instead, use the default V1, V2, and so on that R provides. Also, and this is important, after you put the data into Rattle
, you’ll see that Rattle
makes a guess about the target variable. Its guess, V23, is wrong. The real target variable is V1. So click the appropriate radio buttons to make the changes.
Rattle
Transform tab and click the radio button for Impute and the radio button for Zero/Missing. Click V12 and then Execute. This substitutes Missing for the question mark. (Spoiler alert: With this data frame, it doesn’t make much difference whether you do this or not.)When you create the forest, you should have a confusion matrix with just two rows and two columns. You’ll be pleasantly surprised by the OOB error rate!