Tommy Jung

Anasse Bari, Ph.D. is data science expert and a university professor who has many years of predictive modeling and data analytics experience. Mohamed Chaouchi is a veteran software engineer who has conducted extensive research using data mining methods. Tommy Jung is a software engineer with expertise in enterprise web applications and analytics.

Articles From Tommy Jung

page 1
page 2
page 3
page 4
page 5
41 results
41 results
Predictive Analytics For Dummies Cheat Sheet

Cheat Sheet / Updated 04-27-2022

A predictive analytics project combines execution of details with big-picture thinking. These handy tips and checklists will help keep your project on the rails and out of the woods.

View Cheat Sheet
How to Create a Supervised Learning Model with Logistic Regression

Article / Updated 04-26-2017

After you build your first classification predictive model for analysis of the data, creating more models like it is a really straightforward task in scikit. The only real difference from one model to the next is that you may have to tune the parameters from algorithm to algorithm. How to load your data This code listing will load the iris dataset into your session: >>> from sklearn.datasets import load_iris >>> iris = load_iris() How to create an instance of the classifier The following two lines of code create an instance of the classifier. The first line imports the logistic regression library. The second line creates an instance of the logistic regression algorithm. >>> from sklearn import linear_model >>> logClassifier = linear_model.LogisticRegression(C=1, random_state=111) Notice the parameter (regularization parameter) in the constructor. The regularization parameter is used to prevent overfitting. The parameter isn’t strictly necessary (the constructor will work fine without it because it will default to C=1). Creating a logistic regression classifier using C=150 creates a better plot of the decision surface. You can see both plots below. How to run the training data You’ll need to split the dataset into training and test sets before you can create an instance of the logistic regression classifier. The following code will accomplish that task: >>> from sklearn import cross_validation >>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.10, random_state=111) >>> logClassifier.fit(X_train, y_train) Line 1 imports the library that allows you to split the dataset into two parts. Line 2 calls the function from the library that splits the dataset into two parts and assigns the now-divided datasets to two pairs of variables. Line 3 takes the instance of the logistic regression classifier you just created and calls the fit method to train the model with the training dataset. How to visualize the classifier Looking at the decision surface area on the plot, it looks like some tuning has to be done. If you look near the middle of the plot, you can see that many of the data points belonging to the middle area (Versicolor) are lying in the area to the right side (Virginica). This image shows the decision surface with a C value of 150. It visually looks better, so choosing to use this setting for your logistic regression model seems appropriate. How to run the test data In the following code, the first line feeds the test dataset to the model and the third line displays the output: >>> predicted = logClassifier.predict(X_test) >>> predictedarray([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) How to evaluate the model You can cross-reference the output from the prediction against the y_test array. As a result, you can see that it predicted all the test data points correctly. Here’s the code: >>> from sklearn import metrics >>> predictedarray([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) >>> y_testarray([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) >>> metrics.accuracy_score(y_test, predicted)1.0 # 1.0 is 100 percent accuracy >>> predicted == y_testarray([ True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], dtype=bool) So how does the logistic regression model with parameter C=150 compare to that? Well, you can’t beat 100 percent. Here is the code to create and evaluate the logistic classifier with C=150: >>> logClassifier_2 = linear_model.LogisticRegression( C=150, random_state=111) >>> logClassifier_2.fit(X_train, y_train) >>> predicted = logClassifier_2.predict(X_test) >>> metrics.accuracy_score(y_test, predicted)0.93333333333333335 >>> metrics.confusion_matrix(y_test, predicted)array([[5, 0, 0], [0, 2, 0], [0, 1, 7]]) We expected better, but it was actually worse. There was one error in the predictions. The result is the same as that of the Support Vector Machine (SVM) model. Here is the full listing of the code to create and evaluate a logistic regression classification model with the default parameters: >>> from sklearn.datasets import load_iris >>> from sklearn import linear_model >>> from sklearn import cross_validation >>> from sklearn import metrics >>> iris = load_iris() >>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.10, random_state=111) >>> logClassifier = linear_model.LogisticRegression(, random_state=111) >>> logClassifier.fit(X_train, y_train) >>> predicted = logClassifier.predict(X_test) >>> predictedarray([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) >>> y_testarray([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) >>> metrics.accuracy_score(y_test, predicted)1.0 # 1.0 is 100 percent accuracy >>> predicted == y_testarray([ True, True, True, True, True, True, True, True, True, True, True, True, True, True, True], dtype=bool)

View Article
How to Explain the Results of an R Classification Predictive Analytics Model

Article / Updated 03-24-2017

Another task in predictive analytics is to classify new data by predicting what class a target item of data belongs to, given a set of independent variables. You can, for example, classify a customer by type – say, as a high-value customer, a regular customer, or a customer who is ready to switch to a competitor – by using a decision tree. To see some useful information about the R Classification model, type in the following code: > summary(model) Length Class Mode 1 BinaryTree S4 The Class column tells you that you’ve created a decision tree. To see how the splits are being determined, you can simply type in the name of the variable in which you assigned the model, in this case model, like this: > model Conditional inference tree with 6 terminal nodes Response: seedType Inputs: area, perimeter, compactness, length, width, asymmetry, length2 Number of observations: 147 1) area <= 16.2; criterion = 1, statistic = 123.423 2) area <= 13.37; criterion = 1, statistic = 63.549 3) length2 <= 4.914; criterion = 1, statistic = 22.251 4)* weights = 11 3) length2 > 4.914 5)* weights = 45 2) area > 13.37 6) length2 <= 5.396; criterion = 1, statistic = 16.31 7)* weights = 33 6) length2 > 5.396 8)* weights = 8 1) area > 16.2 9) length2 <= 5.877; criterion = 0.979, statistic = 8.764 10)* weights = 10 9) length2 > 5.877 11)* weights = 40 Even better, you can visualize the model by creating a plot of the decision tree with this code:> plot(model) This is a graphical representation of a decision tree. You can see that the overall shape mimics that of a real tree. It’s made of nodes (the circles and rectangles) and links or edges (the connecting lines). The very first node (starting at the top) is called the root node and the nodes at the bottom of the tree (rectangles) are called terminal nodes. There are five decision nodes and six terminal nodes. At each node, the model makes a decision based on the criteria in the circle and the links, and chooses a way to go. When the model reaches a terminal node, a verdict or a final decision is reached. In this particular case, two attributes, the and the , are used to decide whether a given seed type is in class 1, 2 or 3. For example, take observation #2 from the dataset. It has a of 4.956 and an of 14.88. You can use the tree you just built to decide which particular seed type this observation belongs to. Here’s the sequence of steps: Start at the root node, which is node 1 (the number is shown in the small square at the top of the circle). Decide based on the attribute: Is the of observation #2 less than or equal to (denoted by <=) 16.2? The answer is yes, so move along the path to node 2. At node 2, the model asks: Is the area <= 13.37? The answer is no, so try the next link which asks: Is the area > 13.37? The answer is yes, so move along the path to node 6. At this node the model asks: Is the length2 <= 5.396? It is, and you move to terminal node 7 and the verdict is that observation #2 is of seed type 1. And it is, in fact, seed type 1. The model does that process for all other observations to predict their classes. To find out whether you trained a good model, check it against the training data. You can view the results in a table with the following code: > table(predict(model),trainSet$seedType) 1 2 3 1 45 4 3 2 3 47 0 3 1 0 44 The results show that the error (or misclassification rate) is 11 out of 147, or 7.48 percent. With the results calculated, the next step is to read the table. The correct predictions are the ones that show the column and row numbers as the same. Those results show up as a diagonal line from top-left to bottom-right; for example, [1,1], [2,2], [3,3] are the number of correct predictions for that class. So for seed type 1, the model correctly predicted it 45 times, while misclassifying the seed 7 times (4 times as seed type 2, and 3 times as type 3). For seed type 2, the model correctly predicted it 47 times, while misclassifying it 3 times. For seed type 3, the model correctly predicted it 44 times, while misclassifying it only once. This shows that this is a good model. So now you evaluate it with the test data. Here is the code that uses the test data to predict and store it in a variable (testPrediction) for later use: > testPrediction <- predict(model, newdata=testSet) To evaluate how the model performed with the test data, view it in a table and calculate the error, for which the code looks like this: > table(testPrediction, testSet$seedType) testPrediction 1 2 3 1 23 2 1 2 1 19 0 3 1 0 17 The results show that the error is 5 out of 64, or 7.81 percent. This is consistent with the training data.

View Article
Predictive Analytics: Knowing When to Update Your Model

Article / Updated 11-29-2016

As much as you may not like it, your predictive analytics job is not over when your model goes live. Successful deployment of the model in production is no time to relax. You'll need to closely monitor its accuracy and performance over time. A model tends to degrade over time (some faster than others); and a new infusion of energy is required from time to time to keep that model up and running. To stay successful, a model must be revisited and re-evaluated in light of new data and changing circumstances. If conditions change so they no longer fit the model's original training, then you'll have to retrain the model to meet the new conditions. Such demanding new conditions include An overall change in the business objective The adoption of — and migration to — new and more powerful technology The emergence of new trends in the marketplace Evidence that the competition is catching up Your strategic plan should include staying alert for any such emergent need to refresh your model and take it to the next level, but updating your model should be an ongoing process anyway. You'll keep on tweaking inputs and outputs, incorporating new data streams, retraining the model for the new conditions and continuously refining its outputs. Keep these goals in mind: Stay on top of changing conditions by retraining and testing the model regularly; enhance it whenever necessary. Monitor your model's accuracy to catch any degradation in its performance over time. Automate the monitoring of your model by developing customized applications that report and track the model's performance. Automation of monitoring, or having other team members involved, would alleviate any concerns a data scientist may have over the model’s performance and can improve the use of everyone’s time. Automated monitoring saves time and helps you avoid errors in tracking the model's performance.

View Article
Tips for Building Deployable Models for Predictive Analytics

Article / Updated 11-29-2016

In order to ensure a successful deployment of the predictive model you're building, you'll need to think about deployment very early on. The business stakeholders should have a say in what the final model looks like. Thus, at the beginning of the project, be sure your team discusses the required accuracy of the intended model and how best to interpret its results. Data modelers should understand the business objectives the model is trying to achieve, and all team members should be familiar with the metrics against which the model will be judged. The idea is to make sure everyone is on the same page, working to achieve the same goals, and using the same metrics to evaluate the benefits of the model. Keep in mind that the model's operational environment will most likely be different from the development environment. The differences can be significant, from the hardware and software configurations, to the nature of the data, to the footprint of the model itself. The modelers have to know all the requirements needed for a successful deployment in production before they can build a model that will actually work on the production systems. Implementation constraints can become obstacles that come between the model and its deployment. Understanding the limitations of your model is also critical to ensuring its success. Pay particular attention to these typical limitations: The time the model takes to run The data the model needs; sources, types, and volume The platform on which the model resides Ideally, the model has a higher chance of getting deployed when It uncovers some patterns within the data that were previously unknown. It can be easily interpreted to the business stakeholders. The newly uncovered patterns actually make sense businesswise and offer an operational advantage.

View Article
Using Relevant Data for Predictive Analytics: Avoid “Garbage In, Garbage Out”

Article / Updated 11-29-2016

Predictive analytics begins with good data. More data doesn't necessarily mean better data. A successful predictive analytics project requires, first and foremost, relevant and accurate data. Keeping it simple isn't stupid If you're trying to address a complex business decision, you may have to develop equally complex models. Keep in mind, however, that an overly complex model may degrade the quality of those precious predictions you're after, making them more ambiguous. The simpler you keep your model, the more control you have over the quality of the model's outputs. Limiting the complexity of the model depends on knowing what variables to select before you even start building it — and that consideration leads right back to the people with domain knowledge. Your business experts are your best source for insights into what variables have direct impact on the business problem you're trying to solve. Also, you can decide empirically on what variables to include or exclude. Use those insights to ensure that your training dataset includes most (if not all) the possible data that you expect to use to build the model. Data preparation puts the good stuff in To ensure high data quality as a factor in the success of the model you're building, data preparation and cleaning can be of enormous help. When you're examining your data, pay special attention to Data that was automatically collected (for example, from web forms) Data that didn't undergo thorough screening Data collected via a controlled process Data that may have out-of-range values, data-entry errors, and/or incorrect values Common mistakes that lead to the dreaded “garbage in, garbage out” scenario include these classic goofs: Including more data than necessary Building more complex models than necessary Selecting bad predictor variables or features in your analysis Using data that lacks sufficient quality and relevance

View Article
How to Build a Predictive Analytics Team

Article / Updated 11-29-2016

To assemble your predictive analytics team, you'll need to recruit business analysts, data scientists, and information technologists. Regardless of their particular areas of expertise, your team members should be curious, engaged, motivated, and excited to dig as deep as necessary to make the project — and the business — succeed. Getting business expertise on board Business analysts serve as your domain experts: They provide the business-based perspective on which problems to solve — and give valuable insight on all business-related questions. Their experience and domain knowledge give them an intuitive savvy about what approaches might or might not work, on where to start and what to look at to get something going. A model is only as relevant as the questions you use it to answer. Solid knowledge of your specific business can start you off in the right direction; use your experts' perspectives to determine: Which are the right questions? (Which aspects of your business do you want predictive analytics to improve?) Which is the right data to include in the analysis? (Should your focus be on the efficiency of your business processes? The demographics of your customers? Which body of data stands out as the most critical?) Who are the business stakeholders and how can they benefit from the insights gained from your predictive analytics project? Hiring analytical team members who understand your line of business will help you focus the building of your predictive analytics solutions on the desired business outcomes. Firing up IT and math expertise Data scientists can play an important role linking together the worlds of business and data to the technology and algorithms while following well-established methodologies that are proven to be successful. They have a big say in developing the actual models and their views will affect the outcome of your whole project. This role will require expertise in statistics such as knowledge of regression/non-regression analysis and cluster analysis. (Regression analysis is a statistical method that investigates the relationships between variables.) The role also requires the ability to correctly choose the right technical solutions for the business problem and the ability to articulate the business value of the outcome to the stakeholders. Your data scientists should possess knowledge of advanced algorithms and techniques such as machine learning, data mining, and natural language processing. Then you need IT experts to apply technical expertise to the implementation, monitoring, maintenance, and administration of the needed IT systems. Their job is to make sure the IT infrastructure and all IT strategic assets are stable, secure, and available to enable the business mission. An example of this is making sure the computer network and database work smoothly together. When data scientists have selected the appropriate techniques, then (together with IT experts) they can oversee the overall design of the system's architecture, and improve its performance in response to different environments and different volumes of data. In addition to the usual suspects — business experts, math and statistical modelers, and computer scientists — you may want to spice up your team with specialists from other disciplines such as physics, psychology, philosophy, or liberal arts to generate fresh ideas and new perspectives.

View Article
Enterprise Architecture for Big Data

Article / Updated 11-29-2016

In perspective, the goal for designing an architecture for data analytics comes down to building a framework for capturing, sorting, and analyzing big data for the purpose of discovering actionable results. There is no one correct way to design the architectural environment for big data analytics. However, most designs need to meet the following requirements to support the challenges big data can bring. These criteria can be distributed mainly over six layers and can be summarized as follows: Your architecture should include a big data platform for storage and computation, such as Hadoop or Spark, which is capable of scaling out. Your architecture should include large-scale software and big data tools capable of analyzing, storing, and retrieving big data. These can consist of the components of Spark, or the components of Hadoop ecosystem (such as Mahout and Apache Storm). You might also want to adopt a big data large-scale tool that will be used by data scientists in your business. These include Radoop from RapidMiner, IBM Watson, and many others. Your architecture should support virtualization. Virtualization is an essential element of cloud computing because it allows multiple operating systems and applications to run at the same time on the same server. Because of this capability, virtualization and cloud computing often go hand in hand. You might also adopt a private cloud in your architecture. A private cloud offers the same architecture as a public cloud, except the services in a private cloud are restricted to a certain number of users through a firewall. Amazon Elastic Computer Cloud is one of the major providers of private cloud solutions and storage space for businesses, and can scale as they grow. Your architecture might have to offer real-time analytics if your enterprise is working with fast data (data that is flowing in streams at a fast rate). In a scenario where, you would need to consider an infrastructure that can support the derivation of insights from data in near real time without waiting for data to be written to disk. For example, Apache Spark’s streaming library can be glued with other components to support analytics on fast data streams. Your architecture should account for Big Data security by creating a system of governance around the supply of access to the data and the results. The big data security architecture should be in line with the standard security practices and policies in your organization that govern access to data sources. If you're looking for a robust tool to help you get started on data analytics without the need for expertise in the algorithms and complexities behind building predictive models, then you should try KNIME, RapidMiner, or IBM Watson, among others. Most of the preceding tools offer a comprehensive, ready-to-use toolbox that consists of capabilities that can get you started. For example, RapidMiner has a large number of algorithms from different states of the predictive analytics lifecycle, so it provides a straightforward path to quickly combining and deploying analytics models. With RapidMiner, you can quickly load and prepare your data, create and evaluate predictive models, use data processes in your applications and share them with your business users. With very few clicks, you can easily build a simple predictive analytics model. RapidMiner can be used by both beginners and experts. RapidMiner Studio is an open-source predictive analytics software that has an easy-to-use graphical interface where you can drag and drop algorithms for data loading, data preprocessing, predictive analytics algorithms, and model evaluations to build your data analytics process. RapidMiner was built to provide data scientists with a comprehensive toolbox that consists of more than a thousand different operations and algorithms. The data can be loaded quickly, regardless of whether your data source is in Excel, Access, MS SQL, MySQL, SPSS, Salesforce, or any other format that is supported by RapidMiner. In addition to data loading, predictive model building and model evaluation, this tool also provides you with data visualization tools that include adjustable self-organizing maps and 3-D graphs. RapidMiner offers an open extension application programming interface (API) that allows you to integrate your own algorithms into any pipeline built in RapidMiner. It's also compatible with many platforms and can run on major operating systems. There is an emerging online community of data scientists that use RapidMiner where they can share their processes, and ask and answer questions. Another easy-to-use tool that is widely used in the analytics world is KNIME. KNIME stands for the Konstanz Information Miner. It's an open source data analytics that can help you build predictive models through a data pipelining concept. The tool offers drag-and-drop components for ETL (extraction, Transformation and Loading) and components for predictive modeling as well as data visualization. KNIME and RapidMiner are tools that you can arm your data science team to easily get started building predictive models. For an excellent use case on KNIME, check out the paper “The Seven Techniques for Dimensionality Reduction.” RapidMiner Radoop is a product by RapidMiner that extends predictive analytics toolbox on RapidMiner Studio to run on Hadoop and Spark environments. Radoop encapsulates MapReduce, Pig, Mahout, and Spark. After you define your workflows on Radoop, then instructions are executed in Hadoop or Spark environment, so you don't have to program predictive models but focus on model evaluation and development of new models. For security, Radoop supports Kerberos authentication and integrates with Apache Ranger and Apache Sentry.

View Article
How to Create an Unsupervised Learning Model with Mean Shift

Article / Updated 11-29-2016

One clustering algorithm offered in scikit-learn that can be used in predictive analytics is the mean shift algorithm. This algorithm, like DBSCAN, doesn't require you to specify the number of clusters, or any other parameters, when you create the model. The primary tuning parameter for this algorithm is called the bandwidth parameter. You can think of bandwidth like choosing the size of a round window that can encompass the data points in a cluster. Choosing a value for bandwidth isn't trivial, so go with the default. Running the full dataset The steps to create a new model with a different algorithm is essentially the same each time: Open a new Python interactive shell session. Use a new Python session so that memory is clear and you have a clean slate to work with. Paste the following code in the prompt and observe the output: >>> from sklearn.datasets import load_iris >>> iris = load_iris() Create an instance of mean shift. Type the following code into the interpreter: >>> from sklearn.cluster import MeanShift >>> ms = MeanShift() Mean shift created with default value for bandwidth. Check which parameters were used by typing the following code into the interpreter: >>> ms MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, min_bin_freq=1, n_jobs=1, seeds= None) Fit the Iris data into the mean shift clustering algorithm by typing the following code into the interpreter: >>> ms.fit(iris.data) To check the outcome, type the following code into the interpreter: >>> ms.labels_ array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) Mean shift produced two clusters (0 and 1). Visualizing the clusters A scatter plot is a good way to visualize the relationship between a large number of data points. It's useful for visually identifying clusters of data and finding data points that are distant from formed clusters. Let's produce a scatter plot of the DBSCAN output. Type the following code: >>> import pylab as pl >>> from sklearn.decomposition import PCA >>> pca = PCA(n_components=2).fit(iris.data) >>> pca_2d = pca.transform(iris.data) >>> pl.figure('Figure 13-7') >>> for i in range(0, pca_2d.shape[0]): >>> if ms.labels_[i] == 1: >>> c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r', marker='+') >>> elif ms.labels_[i] == 0: >>> c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g', marker='o') >>> pl.legend([c1, c2], ['Cluster 1', 'Cluster 2')] >>> pl.title('Mean shift finds 2 clusters) >> pl.show() The scatter plot output of this code is shown here. Mean shift found two clusters. You can try to tune the model with the bandwidth parameter to see if you can get a three-cluster solution. Mean shift is very sensitive to the bandwidth parameter: If the chosen value is too big, then the clusters will tend to combine and the final output will be a smaller number of clusters than desired. If the chosen value is too small, then the algorithm may produce too many clusters and it will take longer to run. Evaluating the model Mean shift didn't produce the ideal results with the default parameters for the Iris dataset, but a two-cluster solution is in line with other clustering algorithms. Each project has to be examined individually to see how well the number of cluster fits the business problem. The obvious benefit of using mean shift is that you don’t have to predetermine the number of clusters. In fact, you can use mean shift as a tool to find the number of clusters for creating a K-means model. Mean shift is often used for computer vision applications because it's good at lower dimensions, accommodates clusters of any shape, and accommodates clusters of any size.

View Article
How to Create a Supervised Learning Model with Random Forest for Predictive Analytics

Article / Updated 11-29-2016

The random forest model is an ensemble model that can be used in predictive analytics; it takes an ensemble (selection) of decision trees to create its model. The idea is to take a random sample of weak learners (a random subset of the training data) and have them vote to select the strongest and best model. The random forest model can be used for either classification or regression. In the following example, the random forest model is used to classify the Iris species. Loading your data This code listing will load the iris dataset into your session: >>> from sklearn.datasets import load_iris >>> iris = load_iris() Creating an instance of the classifier The following two lines of code create an instance of the classifier. The first line imports the random forest library. The second line creates an instance of the random forest algorithm: >>> from sklearn.ensemble import RandomForestClassifier >>> rf = RandomForestClassifier(n_estimators=15, random_state=111) The n_estimators parameter in the constructor is a commonly used tuning parameter for the random forest model. The value is used to build the number of trees in the forest. It's generally between 10 and 100 percent of the dataset, but it depends on the data you're using. Here, the value is set at 15, which is 10 percent of the data. Later, you will see that changing the parameter value to 150 (100 percent) produces the same results. The n_estimators is used to tune model performance and overfitting. The greater the value, the better the performance but at the cost of overfitting. The smaller the value, the higher the chances of not overfitting but at the cost of lower performance. Also, there is a point where increasing the number will generally degrade in accuracy improvement and may dramatically increase the computational power needed. The parameter defaults to 10 if it is omitted in the constructor. Running the training data You'll need to split the dataset into training and test sets before you can create an instance of the random forest classifier. The following code will accomplish that task: >>> from sklearn import cross_validation >>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.10, random_state=111) >>> rf = rf.fit(X_train, y_train) Line 1 imports the library that allows you to split the dataset into two parts. Line 2 calls the function from the library that splits the dataset into two parts and assigns the now-divided datasets to two pairs of variables. Line 3 takes the instance of the random forest classifier you just created,then calls the fit method to train the model with the training dataset. Running the test data In the following code, the first line feeds the test dataset to the model, then the third line displays the output: >>> predicted = rf.predict(X_test) >>> predicted array([0, 0, 2, 2, 2, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2]) Evaluating the model You can cross-reference the output from the prediction against the y_test array. As a result, you can see that it predicted two test data points incorrectly. So the accuracy of the random forest model was 86.67 percent. Here's the code: >>> from sklearn import metrics >>> predicted array([0, 0, 2, 2, 2, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2]) >>> y_test array([0, 0, 2, 2, 1, 0, 0, 2, 2, 1, 2, 0, 2, 2, 2]) >>> metrics.accuracy_score(y_test, predicted) 0.8666666666666667 # 1.0 is 100 percent accuracy >>> predicted == y_test array([ True, True, True, True, False, True, True, True, True, True, True, True, False, True, True], dtype=bool) How does the random forest model perform if you change the n_estimators parameter to 150? It looks like it won’t make a difference for this small dataset. It produces the same result: >>> rf = RandomForestClassifier(n_estimators=150, random_state=111) >>> rf = rf.fit(X_train, y_train) >>> predicted = rf.predict(X_test) >>> predicted array([0, 0, 2, 2, 2, 0, 0, 2, 2, 1, 2, 0, 1, 2, 2])

View Article
page 1
page 2
page 3
page 4
page 5