- Large enough quantities of data that are suitable for the algorithm you use
- Clean, well-prepared data suitable for use in machine learning
Besides data quantity, the need for data cleanliness is understandable — it’s just like the quality of teaching you get at school. If your teachers teach you only nonsense, make erroneous examples, spend time joking, and in other ways don’t take teaching seriously, you won’t do well on your examinations no matter how smart you are. The same is true for both simple and complex algorithms — if you feed them garbage data, they just produce nonsense predictions.
According to the principle of garbage in, garbage out (GIGO for short), bad data can truly harm machine learning. Bad data consists of missing data, outliers, skewed value distributions, redundancy of information, and features not well explicated.
Bad data may not be bad in the sense that it’s wrong. Quite often, bad data is just data that doesn’t comply with the standards you set for your data: a label written in many different ways; erratic values spilled over from other data fields; dates written in invalid formats; and unstructured text that you should have structured into a categorical variable.
Enforcing rules of data validity in your databases and working on the design of better data tables as well as the exactness of the process that stores data can prove of invaluable help for machine learning and let you concentrate on solving trickier data problems.