To be able to test the predictive analysis model you built, you need to split your dataset into two sets: training and test datasets. These datasets should be selected at random and should be a good representation of the actual population.
Similar data should be used for both the training and test datasets.
Normally the training dataset is significantly larger than the test dataset.
Using the test dataset helps you avoid errors such as overfitting.
The trained model is run against test data to see how well the model will perform.
Some data scientists prefer to have a third dataset that has characteristics similar to those of the first two: a validation dataset. The idea is that if you’re actively using your test data to refine your model, you should use a separate (third) set to check the accuracy of the model.
Having a validation dataset, that wasn’t used as part of the development process of your model, helps ensure a neutral estimation of the model’s accuracy and efficacy.
If you’ve built multiple models using various algorithms, the validation sample can also help you evaluate which model performs best.
Make sure you double-check your work developing and testing the model. In particular, be skeptical if the performance or the accuracy of the model seems too good to be true. Errors can happen where you least expect them. Incorrectly calculating dates for time series data, for example, can lead to erroneous results.
How to employ cross-validation
Cross-validation is a popular technique you can use to evaluate and validate your model. The same principle of using separate datasets for testing and training applies here: The training data is used to build the model; the model is run against the testing set to predict data it hasn’t seen before, which is one way to evaluate its accuracy.
In cross-validation, the historical data is split into X numbers of subsets. Each time a subset is chosen to be used as test data, the rest of the subsets are used as training data. Then, on the next run, the former test set becomes one of the training sets and one of the former training sets becomes the test set.
The process continues until every subset of that X number of sets has been used as a test set.
For example, imagine you have a dataset that you have divided into 5 sets numbered 1 to 5. In the first run, you use set 1 as the test set and use sets 2, 3, 4 and 5 as the training set. Then, on the second run, you use set 2 as the test set and sets 1, 3, 4, and 5 as training set.
You continue this process until every subset of the 5 sets has been used as a test set.
Cross-validation allows you to use every data point in your historical data for both training and testing. This technique is more effective than just splitting your historical data into two sets, using the set with the most data for training, using the other set for testing, and leaving it at that.
When you cross-validate your data, you’re protecting yourself against randomly picking test data that’s too easy to predict — which would give you the false impression that your model is accurate. Or, if you happen to pick test data that’s too hard to predict, you might falsely conclude that your model isn’t performing as you had hoped.
Cross-validation is widely used not only to validate the accuracy of models but also to compare the performance of multiple models.
How to balance bias and variance
Bias and variance are two sources of errors that can take place as you’re building your analytical model.
Bias is the result of building a model that significantly simplifies the presentation of the relationships among data points in the historical data used to build the model.
Variance is the result of building a model that is explicitly specific to the data used to build the model.
Achieving a balance between bias and variance — by reducing the variance and tolerating some bias — can lead to a better predictive model. This trade-off usually leads to building less complex predictive models.
Many data-mining algorithms have been created to take into account this trade-off between bias and variance.
How to troubleshoot ideas
When you’re testing your model and you find yourself going nowhere, here are a few ideas to consider that may help you get back on track:
Always double-check your work. You may have overlooked something you assumed was correct but isn’t. Such flaws could show up (for example) among the values of a predictive variable in your dataset, or in the preprocessing you applied to the data.
If the algorithm you chose isn’t yielding any results, try another algorithm. For example, you try several classification algorithms available and depending on your data and the business objectives of your model, one of those may perform better than the others.
Try selecting different variables or creating new derived variables. Be always on the lookout for variables that have predictive powers.
Frequently consult with the business domain experts who can help you make sense of the data, select variables, and interpret the model’s results.