One way to avoid becoming a statistic is to approach your AI journey using an industry-proven model — the machine learning development life cycle. This figure shows the seven elements of the methodology. This methodology is based on the cross-industry standard process for data mining (CRISP-DM), a widely used open standard process model that describes common approaches used by data mining experts.
The table shows the questions that must be answered for each element.
Element | Question |
Define the task | What problem or question do you want to address with data? |
Collect data | What data do you have that could answer our questions? |
Prepare the data | What do you need to do to prepare the data for mining? |
Build the model | How can you mimic or enhance the human’s knowledge or actions through technology? |
Test and evaluate the model | What new information do you know now? |
Deployment and integrate the model | What actions should you trigger with the new information? What needs human validation? |
Maintain the model | How has the data changed over time? Do the results reflect current reality? |
In the light of this feedback-and-iterate practice, the model is more a life cycle than a process, largely because, for the most part, in the model, data drives the process, not a hunch or policy or committee or some immutable principle. You start with a hypothesis or a burning question, such as “What do all our loyal customers have in common?” or flip it to ask “What do all our cancellations have in common?” Then you gather the required data, train a model with historical data, run current data to answer that question, and then act on the answer. The steering group provides input along the way, but the data reflects the actual, not the hypothetical.
This principle of data-driven discovery and action is an important part of the life cycle because it assures that the process is defensible and auditable. It keeps the project from going off the rails and down a rabbit hole.
Using the life cycle, you will always be able to answer questions such as how and why you created a particular model, how you will assess its accuracy and effectiveness, how you will use it in a production environment, and how it will evolve over time. You will also be able to identify model drift and determine whether changes to the model based on incoming data are pointing you toward new insights or diverting you toward undesired changes in scope.
Of the seven steps in the methodology, the first three take up the most time. You may recall that cleaning and organizing data takes up to 60 percent of the time of a data scientist. There’s a good reason for that. Bad data can cost up to 25 percent of revenue.However, all that time spent preparing the data will be wasted if you don’t really know what you want out of the data.
Define the task
What problem or question do you want to address with data? Follow these steps:- Determine your business objectives.
- Assess the situation.
- Determine your data mining goals.
- Produce a project plan.
Some people think of AI as a magic machine where you pour data into the hopper, turn the crank, and brilliance comes out the other end. The reality is that a data science project is the process of actually building the machine, not turning the crank. And before you build a machine, you must have a very clear picture of what you want the machine to do.
Even though the process is data-driven, you don’t start with data. You start with questions. You may have a wealth of pristine data nicely tailored to databases, but if you don’t know what you’re trying to do, when you turn the crank, the stuff that comes out the other end might be interesting, but it won’t be actionable.That’s why you start with questions. If you ask the right questions, you will know what kind of data you need. And if you get the right data, at the end you will get the answers — and likely more questions as well.
During the business understanding step, you establish a picture of what success will look like by determining the criteria for success. This step starts with a question. In the course of determining what you need to answer the question, you explore the terminology, assumptions, requirements, constraints, risks, contingencies, costs, and benefits related to the question and assemble an inventory of available resources.
For example, your initial question might be “What is causing an increase in customer churn?” This question could be expanded to ask “Can you pinpoint specific sources of friction in the customer journey that are leading to churn?”Pursuing that question may lead you to brainstorming and research, such as documenting the touchpoints in the current customer journey, analyzing the revenue impact of churn, and listing suspected candidates for friction.
Collect the data
What data do you have that may be able to answer your questions? Follow these steps:- Collect initial data.
- Describe the data.
- Explore the data.
- Verify data quality.
Remember that moment in The Princess Bride when Westley, Inigo, and Fezzik list their assets and liabilities before storming the castle and determine that they will need a wheelbarrow and that a holocaust cloak would come in handy? That was data understanding.
During the data understanding step, you establish the type of data you need, how you will acquire it, and how much data you need. You may source your data internally, from second parties such as solution partners, or from third-party providers.
For example, if you are considering a solution for predictive maintenance on a train, you might pull information from Internet of Things (IoT) sensors, weather patterns, and passenger travel patterns.To make sure you have the data required to answer your questions, you must first ask questions. What data do you have now? Are you using all the data you have? Maybe you’re collecting lots of data, but you use only three out of ten fields.
This step takes time, but it is an essential exercise that will increase the likelihood that you can trust the results and that you aren’t misled by the outcomes.
Prepare the data
What do you need to do to prepare the data for mining? Follow these steps:- Select the data.
- Clean the data.
- Construct the data.
- Integrate the data.
- Format the data.
Clean the data: The available data for your project may have issues, such as missing or invalid values or inconsistent formatting. Cleaning the data involves establishing a uniform notation to express each value and setting default values or using a modeling technique to estimate suitable values for empty fields.
Construct the data: In some cases, you might need a field that can be calculated or inferred from other fields in the data. For example, if you are doing analysis by sales region, detailed order records may not include the region, but that information can be derived from the address. You might even need to create new records to indicate the absence of an activity, such as creating a record with a value of zero to indicate the lack of sales for a product in a region.
Integrate the data: You might encounter a situation where you need to combine information from different data sources that store the data in different ways. For example, suppose you are analyzing store sales by region; if you don’t have a table for store-level data, you need to aggregate the order information for each store from individual orders to create store-level data. Or you may need to merge data from multiple tables. For example, in the store sales by region analysis, you may combine regional information such as manager and sales team from one source with store information from another source into one table.
Format the data: The data you need might be trapped in an image, such as a presentation or graphic, in which case you would have to extract it through some method, such as optical character recognition, and then store the information as structured data.
Build the model
How can you mimic or enhance the human’s knowledge or actions through technology? Follow these steps:- Select an algorithm and modeling techniques.
- Test the fit.
- Build the model.
- Assess the model.
By now, the modeling technique to use should be an obvious choice based on the questions you developed at the beginning and the data you have to work with.
After you have trained the model using the source data set, test its accuracy with the test data set. One way of evaluating test results for a classification model is to use a confusion matrix, which is a simple classification quadrant, also known as Pasteur’s quadrant.
For a simple example, consider a binary classifier that produces a yes or no answer. There are two ways of getting it right (to correctly predict yes or no) and two ways of getting it wrong (to incorrectly predict yes or no). In this case, imagine a recommendation engine offering a yes or no prediction for a specific customer regarding 100 items compared to the customer’s actual responses. This table shows a set of possible results.Iterations = 100 | AI (Predicted) | ||
No | Yes | ||
Customer (Actual) | No | 35 | 10 |
Yes | 5 | 50 |
Prediction | Actual | Category | Percent |
Yes | Yes | True positive | 50 |
No | No | True negative | 35 |
Yes | No | False positive | 10 |
No | Yes | False negative | 5 |
Test and evaluate the model
What new information do you know now? Follow these steps:- Evaluate the results.
- Review the process.
- Determine the next steps.
Deploy and integrate the model
What actions should you trigger with the new information? What needs human validation? Follow these steps:- Plan the deployment.
- Plan monitoring and maintenance.
- Produce the final report and presentation.
- Review the project.
Maintain the model
Because data has a shelf-life, no data science project can run under the set-and-forget philosophy. In the maintenance stage, your model must regularly retrain on fresh data so the answers reflect the new reality of now.The final report can be as simple as a summary of the life of the project and the outcomes, or it can be an exhaustive analysis of the results, their implications, and your plans for implementing the insights.
It’s always a good idea to have a lessons-learned session after any significant effort, particularly if you plan to continue using it. This meeting can cover rabbit trails you followed and insights into best practices.