Home

Gathering and Cleaning Data for Machine Learning

|
Updated:  
2016-10-06 14:02:37
|
TensorFlow For Dummies
Explore Book
Buy On Amazon
Although machines learn from data, no magic recipe exists in the world of algorithms (as the “no free lunch” theorem states) when it comes to data. Even sophisticated and advanced learning functions hit the wall and underperform when you don’t support them with the following:
  • Large enough quantities of data that are suitable for the algorithm you use
  • Clean, well-prepared data suitable for use in machine learning
Data quantity is beneficial in learning when it explains bias and variance trade-offs. As a reminder, large quantities of data can prove beneficial to learning purposes when the variability of the estimates is a problem, because the specific data used for learning heavily influences predictions (the overfitting problem). More data can really help because a larger number of examples aids machine learning algorithms to disambiguate the role of each signal picked up from data and taken into modeling the prediction.

Besides data quantity, the need for data cleanliness is understandable — it’s just like the quality of teaching you get at school. If your teachers teach you only nonsense, make erroneous examples, spend time joking, and in other ways don’t take teaching seriously, you won’t do well on your examinations no matter how smart you are. The same is true for both simple and complex algorithms — if you feed them garbage data, they just produce nonsense predictions.

According to the principle of garbage in, garbage out (GIGO for short), bad data can truly harm machine learning. Bad data consists of missing data, outliers, skewed value distributions, redundancy of information, and features not well explicated.

Bad data may not be bad in the sense that it’s wrong. Quite often, bad data is just data that doesn’t comply with the standards you set for your data: a label written in many different ways; erratic values spilled over from other data fields; dates written in invalid formats; and unstructured text that you should have structured into a categorical variable.

Enforcing rules of data validity in your databases and working on the design of better data tables as well as the exactness of the process that stores data can prove of invaluable help for machine learning and let you concentrate on solving trickier data problems.

About This Article

This article is from the book: 

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.