Articles From Dr. Anasse Bari
Filter Results
Article / Updated 11-29-2016
As much as you may not like it, your predictive analytics job is not over when your model goes live. Successful deployment of the model in production is no time to relax. You'll need to closely monitor its accuracy and performance over time. A model tends to degrade over time (some faster than others); and a new infusion of energy is required from time to time to keep that model up and running. To stay successful, a model must be revisited and re-evaluated in light of new data and changing circumstances. If conditions change so they no longer fit the model's original training, then you'll have to retrain the model to meet the new conditions. Such demanding new conditions include An overall change in the business objective The adoption of — and migration to — new and more powerful technology The emergence of new trends in the marketplace Evidence that the competition is catching up Your strategic plan should include staying alert for any such emergent need to refresh your model and take it to the next level, but updating your model should be an ongoing process anyway. You'll keep on tweaking inputs and outputs, incorporating new data streams, retraining the model for the new conditions and continuously refining its outputs. Keep these goals in mind: Stay on top of changing conditions by retraining and testing the model regularly; enhance it whenever necessary. Monitor your model's accuracy to catch any degradation in its performance over time. Automate the monitoring of your model by developing customized applications that report and track the model's performance. Automation of monitoring, or having other team members involved, would alleviate any concerns a data scientist may have over the model’s performance and can improve the use of everyone’s time. Automated monitoring saves time and helps you avoid errors in tracking the model's performance.
View ArticleArticle / Updated 11-29-2016
In order to ensure a successful deployment of the predictive model you're building, you'll need to think about deployment very early on. The business stakeholders should have a say in what the final model looks like. Thus, at the beginning of the project, be sure your team discusses the required accuracy of the intended model and how best to interpret its results. Data modelers should understand the business objectives the model is trying to achieve, and all team members should be familiar with the metrics against which the model will be judged. The idea is to make sure everyone is on the same page, working to achieve the same goals, and using the same metrics to evaluate the benefits of the model. Keep in mind that the model's operational environment will most likely be different from the development environment. The differences can be significant, from the hardware and software configurations, to the nature of the data, to the footprint of the model itself. The modelers have to know all the requirements needed for a successful deployment in production before they can build a model that will actually work on the production systems. Implementation constraints can become obstacles that come between the model and its deployment. Understanding the limitations of your model is also critical to ensuring its success. Pay particular attention to these typical limitations: The time the model takes to run The data the model needs; sources, types, and volume The platform on which the model resides Ideally, the model has a higher chance of getting deployed when It uncovers some patterns within the data that were previously unknown. It can be easily interpreted to the business stakeholders. The newly uncovered patterns actually make sense businesswise and offer an operational advantage.
View ArticleArticle / Updated 11-29-2016
Predictive analytics begins with good data. More data doesn't necessarily mean better data. A successful predictive analytics project requires, first and foremost, relevant and accurate data. Keeping it simple isn't stupid If you're trying to address a complex business decision, you may have to develop equally complex models. Keep in mind, however, that an overly complex model may degrade the quality of those precious predictions you're after, making them more ambiguous. The simpler you keep your model, the more control you have over the quality of the model's outputs. Limiting the complexity of the model depends on knowing what variables to select before you even start building it — and that consideration leads right back to the people with domain knowledge. Your business experts are your best source for insights into what variables have direct impact on the business problem you're trying to solve. Also, you can decide empirically on what variables to include or exclude. Use those insights to ensure that your training dataset includes most (if not all) the possible data that you expect to use to build the model. Data preparation puts the good stuff in To ensure high data quality as a factor in the success of the model you're building, data preparation and cleaning can be of enormous help. When you're examining your data, pay special attention to Data that was automatically collected (for example, from web forms) Data that didn't undergo thorough screening Data collected via a controlled process Data that may have out-of-range values, data-entry errors, and/or incorrect values Common mistakes that lead to the dreaded “garbage in, garbage out” scenario include these classic goofs: Including more data than necessary Building more complex models than necessary Selecting bad predictor variables or features in your analysis Using data that lacks sufficient quality and relevance
View ArticleArticle / Updated 11-29-2016
To assemble your predictive analytics team, you'll need to recruit business analysts, data scientists, and information technologists. Regardless of their particular areas of expertise, your team members should be curious, engaged, motivated, and excited to dig as deep as necessary to make the project — and the business — succeed. Getting business expertise on board Business analysts serve as your domain experts: They provide the business-based perspective on which problems to solve — and give valuable insight on all business-related questions. Their experience and domain knowledge give them an intuitive savvy about what approaches might or might not work, on where to start and what to look at to get something going. A model is only as relevant as the questions you use it to answer. Solid knowledge of your specific business can start you off in the right direction; use your experts' perspectives to determine: Which are the right questions? (Which aspects of your business do you want predictive analytics to improve?) Which is the right data to include in the analysis? (Should your focus be on the efficiency of your business processes? The demographics of your customers? Which body of data stands out as the most critical?) Who are the business stakeholders and how can they benefit from the insights gained from your predictive analytics project? Hiring analytical team members who understand your line of business will help you focus the building of your predictive analytics solutions on the desired business outcomes. Firing up IT and math expertise Data scientists can play an important role linking together the worlds of business and data to the technology and algorithms while following well-established methodologies that are proven to be successful. They have a big say in developing the actual models and their views will affect the outcome of your whole project. This role will require expertise in statistics such as knowledge of regression/non-regression analysis and cluster analysis. (Regression analysis is a statistical method that investigates the relationships between variables.) The role also requires the ability to correctly choose the right technical solutions for the business problem and the ability to articulate the business value of the outcome to the stakeholders. Your data scientists should possess knowledge of advanced algorithms and techniques such as machine learning, data mining, and natural language processing. Then you need IT experts to apply technical expertise to the implementation, monitoring, maintenance, and administration of the needed IT systems. Their job is to make sure the IT infrastructure and all IT strategic assets are stable, secure, and available to enable the business mission. An example of this is making sure the computer network and database work smoothly together. When data scientists have selected the appropriate techniques, then (together with IT experts) they can oversee the overall design of the system's architecture, and improve its performance in response to different environments and different volumes of data. In addition to the usual suspects — business experts, math and statistical modelers, and computer scientists — you may want to spice up your team with specialists from other disciplines such as physics, psychology, philosophy, or liberal arts to generate fresh ideas and new perspectives.
View ArticleArticle / Updated 11-29-2016
In perspective, the goal for designing an architecture for data analytics comes down to building a framework for capturing, sorting, and analyzing big data for the purpose of discovering actionable results. There is no one correct way to design the architectural environment for big data analytics. However, most designs need to meet the following requirements to support the challenges big data can bring. These criteria can be distributed mainly over six layers and can be summarized as follows: Your architecture should include a big data platform for storage and computation, such as Hadoop or Spark, which is capable of scaling out. Your architecture should include large-scale software and big data tools capable of analyzing, storing, and retrieving big data. These can consist of the components of Spark, or the components of Hadoop ecosystem (such as Mahout and Apache Storm). You might also want to adopt a big data large-scale tool that will be used by data scientists in your business. These include Radoop from RapidMiner, IBM Watson, and many others. Your architecture should support virtualization. Virtualization is an essential element of cloud computing because it allows multiple operating systems and applications to run at the same time on the same server. Because of this capability, virtualization and cloud computing often go hand in hand. You might also adopt a private cloud in your architecture. A private cloud offers the same architecture as a public cloud, except the services in a private cloud are restricted to a certain number of users through a firewall. Amazon Elastic Computer Cloud is one of the major providers of private cloud solutions and storage space for businesses, and can scale as they grow. Your architecture might have to offer real-time analytics if your enterprise is working with fast data (data that is flowing in streams at a fast rate). In a scenario where, you would need to consider an infrastructure that can support the derivation of insights from data in near real time without waiting for data to be written to disk. For example, Apache Spark’s streaming library can be glued with other components to support analytics on fast data streams. Your architecture should account for Big Data security by creating a system of governance around the supply of access to the data and the results. The big data security architecture should be in line with the standard security practices and policies in your organization that govern access to data sources. If you're looking for a robust tool to help you get started on data analytics without the need for expertise in the algorithms and complexities behind building predictive models, then you should try KNIME, RapidMiner, or IBM Watson, among others. Most of the preceding tools offer a comprehensive, ready-to-use toolbox that consists of capabilities that can get you started. For example, RapidMiner has a large number of algorithms from different states of the predictive analytics lifecycle, so it provides a straightforward path to quickly combining and deploying analytics models. With RapidMiner, you can quickly load and prepare your data, create and evaluate predictive models, use data processes in your applications and share them with your business users. With very few clicks, you can easily build a simple predictive analytics model. RapidMiner can be used by both beginners and experts. RapidMiner Studio is an open-source predictive analytics software that has an easy-to-use graphical interface where you can drag and drop algorithms for data loading, data preprocessing, predictive analytics algorithms, and model evaluations to build your data analytics process. RapidMiner was built to provide data scientists with a comprehensive toolbox that consists of more than a thousand different operations and algorithms. The data can be loaded quickly, regardless of whether your data source is in Excel, Access, MS SQL, MySQL, SPSS, Salesforce, or any other format that is supported by RapidMiner. In addition to data loading, predictive model building and model evaluation, this tool also provides you with data visualization tools that include adjustable self-organizing maps and 3-D graphs. RapidMiner offers an open extension application programming interface (API) that allows you to integrate your own algorithms into any pipeline built in RapidMiner. It's also compatible with many platforms and can run on major operating systems. There is an emerging online community of data scientists that use RapidMiner where they can share their processes, and ask and answer questions. Another easy-to-use tool that is widely used in the analytics world is KNIME. KNIME stands for the Konstanz Information Miner. It's an open source data analytics that can help you build predictive models through a data pipelining concept. The tool offers drag-and-drop components for ETL (extraction, Transformation and Loading) and components for predictive modeling as well as data visualization. KNIME and RapidMiner are tools that you can arm your data science team to easily get started building predictive models. For an excellent use case on KNIME, check out the paper “The Seven Techniques for Dimensionality Reduction.” RapidMiner Radoop is a product by RapidMiner that extends predictive analytics toolbox on RapidMiner Studio to run on Hadoop and Spark environments. Radoop encapsulates MapReduce, Pig, Mahout, and Spark. After you define your workflows on Radoop, then instructions are executed in Hadoop or Spark environment, so you don't have to program predictive models but focus on model evaluation and development of new models. For security, Radoop supports Kerberos authentication and integrates with Apache Ranger and Apache Sentry.
View ArticleArticle / Updated 11-29-2016
One clustering algorithm offered in scikit-learn that can be used in predictive analytics is the mean shift algorithm. This algorithm, like DBSCAN, doesn't require you to specify the number of clusters, or any other parameters, when you create the model. The primary tuning parameter for this algorithm is called the bandwidth parameter. You can think of bandwidth like choosing the size of a round window that can encompass the data points in a cluster. Choosing a value for bandwidth isn't trivial, so go with the default. Running the full dataset The steps to create a new model with a different algorithm is essentially the same each time: Open a new Python interactive shell session. Use a new Python session so that memory is clear and you have a clean slate to work with. Paste the following code in the prompt and observe the output: >>> from sklearn.datasets import load_iris >>> iris = load_iris() Create an instance of mean shift. Type the following code into the interpreter: >>> from sklearn.cluster import MeanShift >>> ms = MeanShift() Mean shift created with default value for bandwidth. Check which parameters were used by typing the following code into the interpreter: >>> ms MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, min_bin_freq=1, n_jobs=1, seeds= None) Fit the Iris data into the mean shift clustering algorithm by typing the following code into the interpreter: >>> ms.fit(iris.data) To check the outcome, type the following code into the interpreter: >>> ms.labels_ array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) Mean shift produced two clusters (0 and 1). Visualizing the clusters A scatter plot is a good way to visualize the relationship between a large number of data points. It's useful for visually identifying clusters of data and finding data points that are distant from formed clusters. Let's produce a scatter plot of the DBSCAN output. Type the following code: >>> import pylab as pl >>> from sklearn.decomposition import PCA >>> pca = PCA(n_components=2).fit(iris.data) >>> pca_2d = pca.transform(iris.data) >>> pl.figure('Figure 13-7') >>> for i in range(0, pca_2d.shape[0]): >>> if ms.labels_[i] == 1: >>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r', marker='+') >>> elif ms.labels_[i] == 0: >>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g', marker='o') >>> pl.legend([c1, c2], ['Cluster 1', 'Cluster 2')] >>> pl.title('Mean shift finds 2 clusters) >> pl.show() The scatter plot output of this code is shown here. Mean shift found two clusters. You can try to tune the model with the bandwidth parameter to see if you can get a three-cluster solution. Mean shift is very sensitive to the bandwidth parameter: If the chosen value is too big, then the clusters will tend to combine and the final output will be a smaller number of clusters than desired. If the chosen value is too small, then the algorithm may produce too many clusters and it will take longer to run. Evaluating the model Mean shift didn't produce the ideal results with the default parameters for the Iris dataset, but a two-cluster solution is in line with other clustering algorithms. Each project has to be examined individually to see how well the number of cluster fits the business problem. The obvious benefit of using mean shift is that you don’t have to predetermine the number of clusters. In fact, you can use mean shift as a tool to find the number of clusters for creating a K-means model. Mean shift is often used for computer vision applications because it's good at lower dimensions, accommodates clusters of any shape, and accommodates clusters of any size.
View ArticleArticle / Updated 11-29-2016
The random forest model is an ensemble model that can be used in predictive analytics; it takes an ensemble (selection) of decision trees to create its model. The idea is to take a random sample of weak learners (a random subset of the training data) and have them vote to select the strongest and best model. The random forest model can be used for either classification or regression. In the following example, the random forest model is used to classify the Iris species. Loading your data This code listing will load the iris dataset into your session: >>> from sklearn.datasets import load_iris >>> iris = load_iris() Creating an instance of the classifier The following two lines of code create an instance of the classifier. The first line imports the random forest library. The second line creates an instance of the random forest algorithm: >>> from sklearn.ensemble import RandomForestClassifier >>> rf = RandomForestClassifier(n_estimators=15, random_state=111) The n_estimators parameter in the constructor is a commonly used tuning parameter for the random forest model. The value is used to build the number of trees in the forest. It's generally between 10 and 100 percent of the dataset, but it depends on the data you're using. Here, the value is set at 15, which is 10 percent of the data. Later, you will see that changing the parameter value to 150 (100 percent) produces the same results. The n_estimators is used to tune model performance and overfitting. The greater the value, the better the performance but at the cost of overfitting. The smaller the value, the higher the chances of not overfitting but at the cost of lower performance. Also, there is a point where increasing the number will generally degrade in accuracy improvement and may dramatically increase the computational power needed. The parameter defaults to 10 if it is omitted in the constructor. Running the training data You'll need to split the dataset into training and test sets before you can create an instance of the random forest classifier. The following code will accomplish that task: >>> from sklearn import cross_validation >>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.10, random_state=111) >>> rf = rf.fit(X_train, y_train) Line 1 imports the library that allows you to split the dataset into two parts. Line 2 calls the function from the library that splits the dataset into two parts and assigns the now-divided datasets to two pairs of variables. Line 3 takes the instance of the random forest classifier you just created,then calls the fit method to train the model with the training dataset. Running the test data In the following code, the first line feeds the test dataset to the model, then the third line displays the output: >>> predicted = rf.predict(X_test) >>> predicted array([0, 0, 2, 2, 2, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2]) Evaluating the model You can cross-reference the output from the prediction against the y_test array. As a result, you can see that it predicted two test data points incorrectly. So the accuracy of the random forest model was 86.67 percent. Here's the code: >>> from sklearn import metrics >>> predicted array([0, 0, 2, 2, 2, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2]) >>> y_test array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) >>> metrics.accuracy_score(y_test, predicted) 0.8666666666666667 # 1.0 is 100 percent accuracy >>> predicted == y_test array([ True, True, True, True, False, True, True, True, True, True, True, True, False, True, True], dtype=bool) How does the random forest model perform if you change the n_estimators parameter to 150? It looks like it won’t make a difference for this small dataset. It produces the same result: >>> rf = RandomForestClassifier(n_estimators=150, random_state=111) >>> rf = rf.fit(X_train, y_train) >>> predicted = rf.predict(X_test) >>> predicted array([0, 0, 2, 2, 2, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2])
View ArticleArticle / Updated 11-29-2016
Big data has the potential to inspire businesses to make better decisions through predictive analytics. It's important to be aware of the tools that can quickly help you create good visualization. You want to always keep your audience engaged and interested. Here are some popular visualization tools for large scale enterprise analytics. Most of these tools don't require any coding experience and they are easy to use. If your raw data is in Excel sheets or resides in databases, you can load your data into these tools to visualize it for data exploration and analytics purposes. Alternatively, you may have the results from applying a predictive model on your data ready on spreadsheets, so (or and) you can also use these tools to visualize those results. Tableau Tableau is a visualization tool for enterprise analytics. With Tableau, you can load your data and visualize it in charts, maps, tree maps, histograms, and word clouds. You can run Tableau as a desktop application a server, or a cloud-based solution. Tableau integrates with many big-data platforms, such as R, RapidMiner, and Hadoop. Tableau pulls data from major databases and supports many file formats. Tableau for enterprise isn't free. For academic purposes, Tableau can provide free licenses. Google Charts Google chart tools are free, and easy to use. They include histograms, geo charts, column charts, scatter charts, timeline charts, and organizational charts. Google Charts are interactive, zoomable, and can run on HTML5 and SVG. Google charts can also visualize real-time data. Plotly Plotly is another visualization tool that your teams of developers can adopt using APIs. You can create charts and dashboard with Plotly. Plotly is compatible with Python R and Matlab and its visualization can be embedded in web-based applications. Infogram Infogram helps you create visualizations in a three-step process: choosing a template, adding charts to visualize your data, and then sharing your visualizations. A monthly fee is required to use the tool for a professional version, a business version, or an enterprise. The tool can support multiple accounts for your team.
View ArticleArticle / Updated 11-29-2016
A visualization can represent a simulation (a pictorial representation of a what-if scenario) in predictive analytics. You can follow up a visualization of a prediction with a simulation that overlaps and supports the prediction. For example, what happens if the company stops manufacturing Product D? What happens if a natural disaster strikes the home office? What happens if your customers lose interest in a particular product? You can use visualization to simulate the future behavior of a company, a market, a weather system — you name it. A dashboard is another type of visualization you can use to display a comprehensive predictive analytics model. The dashboard will allow you, using a control button, to change any step in the predictive analytics pipeline. This can include selecting the data, data preprocessing, selecting a predictive model, and selecting the right evaluation versions. You can easily modify any part of the pipeline at anytime using the control button on the dashboard. A dashboard is an interactive type of visualization where you have control and you can change the diagrams, tables, or maps dynamically based on the inputs you choose to include in the analyses that generate those charts and graphs. At least one predictive analytics technique is purely inspired by the natural phenomenon of birds flocking. The bird-flocking model not only identifies groupings in data, it shows them in dynamic action. The same technique can be used to picture hidden patterns in your data. The model represents data objects as birds flying in a virtual space, following flocking rules that orchestrate how a migrating swarm of birds moves in nature. Representing several data objects as birds reveals that similar data objects will flock together to form subflocks (groupings). The similarity among objects in the real world is what drives the movements of the corresponding birds in the virtual space. For example, imagine that you want to analyze the online data collected from several Internet users (also known as netizens). Every piece of information (gleaned from such sources as social network user information and customer online transactions) will be represented as a corresponding bird in the virtual space. If the model finds that two or more users interact with each other through email or chat, appear in the same online photo, buy the same product, or share the same interests, the model shows those two netizens as birds that flock together, following natural flocking rules. The interaction (that is, how close the representative birds get to each other) is expressed as a mathematical function that depends on the frequency of social interaction, or the intensity with which the users buy the same products or share the same interests. This latest mathematical function depends purely on the type of analytics you’re applying. The image above depicts the interaction on Facebook between Netizens X and Y in cyberspace as bird-flocking virtual space, where both X and Y are represented as birds. Because Netizens X and Y have interacted with each other, the next flocking iteration will show their two birds as closer together. An algorithm known as “flock by leader,” invented by Prof. Anasse Bari and Prof. Bellaachia (see the following references), was inspired by a recent discovery that revealed the leadership dynamics in pigeons. This algorithm can mine user input for data points that enable it to detect leaders, discover their followers, and initiate flocking behavior in virtual space that closely mimics what happens when flocks form naturally — except the flocks, in this case, are data clusters called data flocks. This technique not only detects patterns in data, but also provides a clear pictorial representation of the results obtained by applying predictive analytics models. The rules that orchestrate natural flocking behavior in nature were extended to create new flocking rules that conform to data analytics: Data flock homogeneity: Members of the flock show similarity in data. Data flock leadership: The model anticipates information leaders. Representing a large dataset as a flock of birds is one way to easily visualize big data in a dashboard. This visualization model can be used to detect pieces of data that are outliers, leaders, or followers. One political application could be to visualize community outliers, community leaders, or community followers. In the biomedical field, the model can be used to visualize outliers’ genomes and leaders among genetic samples of a particular disease (say, those that show a particular mutation most consistently). A bird-flocking visualization can also be used to predict future patterns of unknown phenomena in cyberspace — civil unrest, an emerging social movement, a future customer’s lineage. The flocking visualization is especially useful if you’re receiving a large volume of streamed data at high velocity: You can see the formation of flocking in the virtual space that contains the birds that represent your data objects. The results of data analytics are reflected (literally) on the fly on the virtual space. Reality given a fictional, yet observable and analytically meaningful, representation purely inspired from nature. Such visualizations can also work well as simulations or what-if scenarios. A visualization based on flocking behavior starts by indexing each netizen to a virtual bird. Initially, all the birds are idle. As data comes in, each bird starts flocking in the virtual space according to the analytics results and the flocking rules. Below, the emerging flock is formed as the analytics are presented. After analyzing data over a large period of time ending at t+k, the results of this application of predictive analytics results can be depicted as shown below: The flock-by-leader algorithm differentiates the members of the flock into three classes: a leader, followers, and outliers. The flock-by-leader algorithm was invented by Dr.Bari and Dr.Bellaachia and it is explained in details in these resources: “Flock by Leader: A Novel Machine Learning Biologically-Inspired Clustering Algorithm”, IEEE International Conference of Swarm Intelligence, 2012. This also appears as a book chapter in Advances in Swarm Intelligence, 2012 Edition – (Springer-Verlag). “SFLOSCAN: A Biologically Inspired Data Mining Framework for Community Identification in Dynamic Social Networks”, IEEE International Conference on Computational Intelligence, 2011 (SSCI 2011), 2011.
View ArticleArticle / Updated 11-29-2016
Often, you need to be able to show the results of your predictive analytics to those who matter. Here are some ways to use visualization techniques to report the results of your models to the stakeholders. Visualizing hidden groupings in your data Data clustering is the process of discovering hidden groups of related items within your data. In most cases, a cluster (grouping) consists of data objects of the same type such as social network users, text documents, or emails. One way to visualize the results of a data-clustering model is shown below, where the graph represents social communities (clusters) that were discovered in data collected from social network users. The data about customers was collected in a tabular format; then a clustering algorithm was applied to the data, and the three clusters (groups) were discovered: loyal customers, wandering customers, and discount customers. Assume that the X and Y axis represent the two principal components generated of the original data. Principal component analysis (PCA) is a data reduction technique. Here the visual relationship among the three groups already suggests where enhanced and targeted marketing efforts might do the most good. Visualizing data classification results A classification model assigns a specific class to each new data point it examines. The specific classes, in this case, could be the groups that result from your clustering work. The output highlighted in the graph can define your target sets. For any given new customer, a predictive classification model attempts to predict which group the new customer will belong to. After you’ve applied a clustering algorithm and discovered groupings in the customer data, you come to a moment of truth: Here comes a new customer — you want the model to predict which type of customer he or she will be. The image shows how a new customer’s information is fed to your predictive analytics model, which in turn predicts which group of customers this new customer belongs to. New Customers A, B, and C are about to be assigned to clusters according the classification model. Applying the classification model resulted in a prediction that Customer A would belong with the loyal customers, Customer B would be a wanderer, and Customer C was only showing up for the discount. Visualizing outliers in your data In the course of clustering or classifying new customers, every now and then you run into outliers (special cases that don’t fit the existing divisions). Below, you see a few outliers that don’t fit well into the predefined clusters. Six outlier customers have been detected and visualized. They behave differently enough that the model can’t tell whether they belong to any of defined categories of customers. Visualization of Decision Trees Many models use decision trees as their outputs: These diagrams show the possible results from alternative courses of action, laid out like the branches of a tree. The image below shows an example of a tree used as a classifier: It classifies baseball fans based on a few criteria, mainly the amount spent on tickets and the purchase dates. From this visualization, you can predict the type of fan that a new ticket-buyer will be: casual, loyal, bandwagon, diehard, or some other type. Attributes of each fan are mentioned at each level in the tree (total number of attended games, total amount spent, season); you can follow a path from a particular “root” to a specific “leaf” on the tree, where you hit one of the fan classes (c1, c2, c3, c4, c5). Suppose you want to determine the type of baseball fan a customer is so that you can determine what type of marketing ads to send to the customer. Suppose you hypothesize that baseball fanatics and bandwagon fans can be persuaded to buy a new car when their team is doing well and headed for the playoffs. You may want to send marketing ads and discounts to persuade them to make the purchase. Further, suppose you hypothesize that bandwagon fans can be persuaded to vote in support of certain political issues. You can send them marketing ads asking them for that support. If you know what type of fan base you have, using decision trees can help you decide how to approach it as a range of customer types. Visualizing predictions Assume you’ve run an array of predictive analytics models, including decision trees, random forests, and flocking algorithms. You can combine all those results and present a consistent narrative that they all support. Here confidence is a numerical percentage that can be calculated using a mathematical function. The result of the calculation encapsulates a score of how probable a possible occurrence is. On the x axis, the supporting evidence represents the content source that was analyzed with content-analytics models that identified the possible outcomes. In most cases, your predictive model would have processed a large dataset, using data from various sources, to derive those possible outcomes. Thus you need show only the most important supporting evidence in your visualization. Above, a summary of the results obtained from applying predictive analytics is presented as a visualization that illustrates possible outcomes, along with a confidence score and supporting evidence for each one. Three possible scenarios are shown: The inventory of Item A will not keep up with demand if you don’t ship at least 100 units weekly to Store S. (Confidence score: 98 percent.) The number of sales will increase by 40 percent if you increase the production of Item A by at least 56 percent. (Confidence score: 83 percent.) A marketing campaign in California will increase sales of Items A and D but not Item K. (Confidence score: 72 percent.) The confidence score represents the likelihood that each scenario will happen, according to your predictive analytics model. Note that they are listed here in descending order of likelihood. Here the most important supporting evidence consists of how excerpts from several content sources are presented over the x axis. You can refer to them if you need to explain how you got to a particular possible scenario — and trot out the evidence that supports it. The power behind this visualization is its simplicity. Imagine, after months of applying predictive analytics to your data, working your way through several iterations, that you walk into a meeting with the decision maker. You’re armed with one slide visualization of three possible scenarios that might have a huge impact on the business. Such a visualization creates effective discussions and can lead management to “aha” moments.
View Article