As with many aspects of any business system, data is a human creation — so it’s apt to have some limits on its usability when you first obtain it. Here’s an overview of some limitations you’re likely to encounter:
The data could be incomplete. Missing values, even the lack of a section or a substantial part of the data, could limit its usability.
For example, your data might cover only one or two conditions of a larger set that you’re trying to model — as when a model built to analyze stock market performance only has data available from the past 5 years, which skews both the data and the model toward the assumption of a bull market.
The moment the market undergoes any correction that leads to a bear market, the model fails to adapt — simply because it wasn’t trained and tested with data representing a bear market.
Make sure you’re looking at a timeframe that gives you a complete picture of the natural fluctuations of your data; your data shouldn’t be limited by seasonality.
If you’re using data from surveys, keep in mind that people don’t always provide accurate information. Not everyone will answer truthfully about (say) how many times they exercise — or how many alcoholic beverages they consume — per week. People may not be dishonest so much as self-conscious, but the data is still skewed.
Data collected from different sources can vary in quality and format. Data collected from such diverse sources as surveys, e-mails, data-entry forms, and the company website will have different attributes and structures. Data from various sources may not have much compatibility among data fields. Such data requires major preprocessing before it’s analysis-ready. The accompanying sidebar provides an example.
Data collected from multiple sources may have differences in formatting, duplicate records, and inconsistencies across merged data fields. Expect to spend a long time cleaning such data — and even longer validating its reliability.
To determine the limitations of your data, be sure to:
Verify all the variables you’ll use in your model.
Assess the scope of the data, especially over time, so your model can avoid the seasonality trap.
Check for missing values, identify them, and assess their impact on the overall analysis.
Watch out for extreme values (outliers) and decide on whether to include them in the analysis.
Confirm that the pool of training and test data is large enough.
Make sure data type (integers, decimal values, or characters, and so forth) is correct and set the upper and lower bounds of possible values.
Pay extra attention to data integration when your data comes from multiple sources.
Be sure you understand your data sources and their impact on the overall quality of your data.
Choose a relevant dataset that is representative of the whole population.
Choose the right parameters for your analysis.
Even after all this care and attention, don’t be surprised if your data still needs preprocessing before you can analyze it accurately. Preprocessing often takes a long time and significant effort because it has to address several issues related to the original data — these issues include:
Any values missing from the data.
Any inconsistencies and/or errors existing in the data.
Any duplicates or outliers in the data.
Any normalization or other transformation of the data.
Any derived data needed for the analysis.