- Identify your data sources. Data could be in different formats or reside in various locations.
- Identify how you will access that data. Sometimes, you would need to acquire third-party data, or data owned by a different division in your organization, etc.
- Consider which variables to include in your analysis.
One standard approach is to start off with a wide range of variables and eliminate the ones that offer no predictive value for the model.
- Determine whether to use derived variables. In many cases, a derived variable (such as the price-per-earning ratio used to analyze stock prices) would have a greater direct impact on the model than would the raw variable.
- Explore the quality of your data, seeking to understand both its state and limitations.
The accuracy of the model's predictions is directly related to the variables you select and the quality of your data. You would want to answer some data-specific questions at this point:
- Is the data complete?
- Does it have any outliers?
- Does the data need cleansing?
- Do you need to fill in missing values, keep them as they are, or eliminate them altogether?
- Regression algorithms can be used to analyze time-series data.
- Classification algorithms can be used to analyze discrete data.
- Association algorithms can be used for data with correlated attributes.
Gathering relevant data (preferably many records over a long time period), preprocessing, and extracting the features with most predictive values will be where you spend the majority of your time. But you still have to choose the algorithm wisely, an algorithm that should be suited to the business problem.
Data preparation is specific to the project you're working on and the algorithm you choose to employ. Depending on the project’s requirements, you will prepare your data accordingly and feed it to the algorithm as you build your model to address the business needs.
The dataset used to train and test the model must contain relevant business information to answer the problem you're trying to solve. If your goal is (for example) to determine which customer is likely to churn, then the dataset you choose must contain information about customers who have churned in the past in addition to customers who have not.
Some models created to mine data and make sense of its underlying relationships — for example, those built with clustering algorithms — needn't have a particular end result in mind.
Underfitting
Underfitting is when your model can't detect any relationships in your data. This is usually an indication that essential variables — those with predictive power — weren't included in your analysis.If the variables used in your model don’t have high predictive power, then try adding new domain-specific variables and re-run your model. The end goal is to improve the performance of the model on the training data.
Another issue to watch for is seasonality (when you have seasonal pattern, if you fail to analyze multiple seasons you may get into trouble.) For example, a stock analysis that includes only data from a bull market (where overall stock prices are going up) doesn't account for crises or bubbles that can bring major corrections to the overall performance of stocks. Failing to include data that spans both bull and bear markets (when overall stock prices are falling) keeps the model from producing the best possible portfolio selection.