Big data analysis has gotten a lot of hype recently, and for good reason. You will need to know the characteristics of big data analysis if you want to be a part of this movement. Companies know that something is out there, but until recently, have not been able to mine it. This pushing the envelope on analysis is an exciting aspect of the big data analysis movement.
Companies are excited to be able to access and analyze data that they’ve been collecting or want to gain insight from, but have not been able to manage or analyze effectively. It might involve visualizing huge amounts of disparate data, or it might involve advanced analyzed streaming at you in real time. It is evolutionary in some respects and revolutionary in others.
So, what’s different when your company is pushing the envelope with big data analysis? The infrastructure supporting big data analysis is different and algorithms have been changed to be infrastructure aware.
Big data analysis should be viewed from two perspectives:
Decision-oriented
Action-oriented
Decision-oriented analysis is more akin to traditional business intelligence. Look at selective subsets and representations of larger data sources and try to apply the results to the process of making business decisions. Certainly these decisions might result in some kind of action or process change, but the purpose of the analysis is to augment decision making.
Action-oriented analysis is used for rapid response, when a pattern emerges or specific kinds of data are detected and action is required. Taking advantage of big data through analysis and causing proactive or reactive behavior changes offer great potential for early adopters.
Finding and utilizing big data by creating analysis applications can hold the key to extracting value sooner rather than later. To accomplish this task, it is more effective to build these custom applications from scratch or by leveraging platforms and/or components.
First, look at some of the additional characteristics of big data analysis that make it different from traditional kinds of analysis aside from the three Vs of volume, velocity, and variety:
It can be programmatic. One of the biggest changes in analysis is that in the past you were dealing with data sets you could manually load into an application and explore. With big data analysis, you may be faced with a situation where you might start with raw data that often needs to be handled programmatically to do any kind of exploration because of the scale of the data.
It can be data driven. While many data scientists use a hypothesis-driven approach to data analysis (develop a premise and collect data to see whether that premise is correct), you can also use the data to drive the analysis — especially if you’ve collected huge amounts of it. For example, you can use a machine-learning algorithm to do this kind of hypothesis-free analysis.
It can use a lot of attributes. In the past, you might have been dealing with hundreds of attributes or characteristics of that data source. Now you might be dealing with hundreds of gigabytes of data that consist of thousands of attributes and millions of observations. Everything is now happening on a larger scale.
It can be iterative. More compute power means that you can iterate on your models until you get them how you want them. Here’s an example. Assume you’re building a model that is trying to find the predictors for certain customer behaviors associated. You might start off extracting a reasonable sample of data or connecting to where the data resides. You might build a model to test a hypothesis.
Whereas in the past you might not have had that much memory to make your model work effectively, you will need a tremendous amount of physical memory to go through the necessary iterations required to train the algorithm. It may also be necessary to use advanced computing techniques like natural language processing or neural networks that automatically evolve the model based on learning as more data is added.
It can be quick to get the compute cycles you need by leveraging a cloud-based Infrastructure as a Service. With Infrastructure as a Service (IaaS) platforms like Amazon Cloud Services (ACS), you can rapidly provision a cluster of machines to ingest large data sets and analyze them quickly.