Using codes for data reduces data entry time, prevents errors, and reduces the memory requirements for storing the data. But the codes aren’t meaningful unless you have documentation, or labels, to explain their meaning.
Some data formats enable you to enjoy the advantages of using codes while keeping the information about the meaning of the codes in the same file. These aren’t typical in data mining — you’re more likely to see them in statistical analysis products — but some data-mining applications can use these labeled data formats. Here’s how they work.
Data appears to contain only numbers, but these numbers are codes for values of categorical variables.
This dataset is open in the PSPP statistical analysis application.
The same dataset with labels instead of numeric codes.
You can switch back and forth between these two display options using the menu.
Although the data is stored as numbers, the labels allow you to see what the data means.
In the figure, you are looking at it in the data editor. You also can set up an analysis or view the results.
You can include comments in a dataset.
You may also find other types of data labels in data-mining applications. The native data format for Weka allows you to include comments in a dataset. This gives you a good place to put annotations about the source of the data and other important details.
You can annotate data.
RapidMiner also has an option for annotations. You can use the graphic user interface to enter annotations for individual rows of data.