Home

Training, Validating, and Testing in Machine Learning

|
|  Updated:  
2016-10-06 13:31:49
TensorFlow For Dummies
Explore Book
Buy On Amazon
In a perfect world, you could perform a test on data that your machine learning algorithm has never learned from before. However, waiting for fresh data isn’t always feasible in terms of time and costs.

As a first simple remedy, you can randomly split your data into training and test sets. The common split is from 25 to 30 percent for testing and the remaining 75 to 70 percent for training. You split your data consisting of your response and features at the same time, keeping correspondence between each response and its features.

The second remedy occurs when you need to tune your learning algorithm. In this case, the test split data isn’t a good practice because it causes another kind of overfitting called snooping. To overcome snooping, you need a third split, called a validation set. A suggested split is to have your examples partitioned in thirds: 70 percent for training, 20 percent for validation, and 10 percent for testing.

You should perform the split randomly, that is, regardless of the initial ordering of the data. Otherwise, your test won’t be reliable, because ordering could cause overestimation (when there is some meaningful ordering) or underestimation (when distribution differs by too much). As a solution, you must ensure that the test set distribution isn’t very different from the training distribution, and that sequential ordering occurs in the split data.

For example, check whether identification numbers, when available, are continuous in your sets. Sometimes, even if you strictly abide by random sampling, you can’t always obtain similar distributions among sets, especially when your number of examples is small.

When your number of examples n is high, such as n>10,000, you can quite confidently create a randomly split dataset. When the dataset is smaller, comparing basic statistics such as mean, mode, median, and variance across the response and features in the training and test sets will help you understand whether the test set is unsuitable. When you aren’t sure that the split is right, just recalculate a new one.

About This Article

This article is from the book: 

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.