Big Data For Dummies

Overview

Dig into data!

Big Data For Dummies will improve your big data knowledge, from the ground floor—basics like what big data is and why you should care about it—all the way to more advanced concepts like implementing big data solutions, securing and storing data, and presenting big data. In a world where data informs almost everything, it’s time to transform your business. Let this book be your guide!

Updates to the latest edition of Big Data For Dummies include a bigger focus on cloud computing (yes, that’s big data floating around in the cloud), innovative ways that organizations use data for decision-making, and the ethics of collecting and using customer data. For IT and business professionals alike,

this is the complete guide to managing massive amounts of big data.

  • Discover why big data is taking over the business world, and how it can help you get to know your customers better
  • Make sense of the big data that your business generates daily—and realize an average cost reduction of 10% while you’re at it!
  • Learn where to find your data, how to organize it, and what it can teach you
  • Secure your data and your customers’ privacy with ethical analytics

The amount of information out there is vast and coming in faster than ever. When you’re up to the task of wrangling data, you and your organization stand to reap some awesome benefits. Big Data For Dummies shows you how.

Read More

About The Author

Judith Hurwitz is an expert in cloud computing, information management, and business strategy.

Alan Nugent has extensive experience in cloud-based big data solutions.

Dr. Fern Halper specializes in big data and analytics.

Marcia Kaufman specializes in cloud infrastructure, information management, and analytics.

Sample Chapters

big data for dummies

CHEAT SHEET

To stay competitive today, companies must find practical ways to deal with big data — that is, to learn new ways to capture and analyze growing amounts of information about customers, products, and services.Data is becoming increasingly complex in structured and unstructured ways. New sources of data come from machines, such as sensors; social business sites; and website interaction, such as click-stream data.

HAVE THIS BOOK?

Articles from
the book

In general, text analytics solutions for big data use a combination of statistical and Natural Language Processing (NLP) techniques to extract information from unstructured data. NLP is a broad and complex field that has developed over the last 20 years. A primary goal of NLP is to derive meaning from text. Natural Language Processing generally makes use of linguistic concepts such as grammatical structures and parts of speech.
Big data is all about high velocity, large volumes, and wide data variety, so the physical infrastructure will literally "make or break" the implementation. Most big data implementations need to be highly available, so the networks, servers, and physical storage must be resilient and redundant. Resiliency and redundancy are interrelated.
Many companies are exploring big data problems and coming up with some innovative solutions. Now is the time to pay attention to some best practices, or basic principles, that will serve you well as you begin your big data journey. In reality, big data integration fits into the overall process of integration of data across your company.
You will find value in bringing the capabilities of the data warehouse and the big data environment together. You need to create a hybrid environment where big data can work hand in hand with the data warehouse. First it is important to recognize that the data warehouse as it is designed today will not change in the short term.
A number of vendors on the market today support the growing need with big data solutions for your business. Here is a listing of a few solutions that you may find interesting: IBM is taking an enterprise approach to big data and integrating across the platform including embedding/bundling its analytics. Its products include a warehouse (InfoSphere warehouse) that has its own built-in data-mining and cubing capability.
Custom and third-party applications offer an alternative method of sharing and examining big data sources. Although all the layers of the reference architecture are important in their own right, this layer is where most of the innovation and creativity is evident. These applications are either horizontal, in that they address problems that are common across industries, or vertical, in that they are intended to help solve an industry-specific problem.
A number of cloud delivery models exist for big data. Try talking to those with experience to figure out which type of delivery model is best for your big data initiative. Infrastructure as a Service Infrastructure as a Service (IaaS) is one of the most straightforward of the cloud computing services. IaaS is the delivery of computing services including hardware, networking, storage, and data center space based on a rental model.
Two key cloud models are important in the discussion of big data — public clouds and private clouds. Cloud computing is a method of providing a set of shared computing resources that include applications, computing, storage, networking, development, and deployment platforms, as well as business processes. Cloud computing turns traditional siloed computing assets into shared pools of resources.
Cloud providers come in all shapes and sizes and offer many different products for big data. Some are household names while others are recently emerging. Some of the cloud providers that offer IaaS services that can be used for big data include Amazon.com, AT&T, GoGrid, Joyent, Rackspace, IBM, and Verizon/Terremark.
To stay competitive today, companies must find practical ways to deal with big data — that is, to learn new ways to capture and analyze growing amounts of information about customers, products, and services.Data is becoming increasingly complex in structured and unstructured ways. New sources of data come from machines, such as sensors; social business sites; and website interaction, such as click-stream data.
Four stages are part of the planning process that applies to big data. As more businesses begin to use the cloud as a way to deploy new and innovative services to customers, the role of data analysis will explode. Therefore, consider another part of your planning process and add three more stages to your data cycle.
You will find lots of resources that can help you start making sense of the big data world. Standard organizations are tackling some of the key emerging issues with getting data resources to work together effectively. Open source offerings can help you experiment easily so that you can better understand what is possible with big data.
Reducing energy consumption, finding new sources of renewable energy, and increasing energy efficiency are all important big data goals for protecting the environment and sustaining economic growth. Large volumes of data in motion are increasingly being monitored and analyzed in real time to help achieve these goals.
Big data is of enormous significance to the healthcare industry — including its use in everything from genetic research to advanced medical imaging and research on improving quality of care. While conducting big data analysis in each of these areas is significant in furthering research, a major benefit is applying this information to clinical medicine.
Almost every area of a city has the capability to use big data, whether in the form of taxes, sensors on buildings and bridges, traffic pattern monitoring, location data, and data about criminal activity. Creating workable policies that make cities safer, more efficient, and more desirable places to live and work requires the collection and analysis of huge amounts of data from a variety of sources.
Big data research can help in the business world, but it also has an environmental purpose. Scientists measure and monitor various attributes of lakes, rivers, oceans, seas, wells, and other water environments to support environmental research. Important research on water conservation and sustainability depends on tracking and understanding underwater environments and knowing how they change.
Virtualization is ideal for big data because it separates resources and services from the underlying physical delivery environment, enabling you to create many virtual systems within a single physical system. One of the primary reasons that companies have implemented virtualization is to improve the performance and efficiency of processing of a diverse mix of workloads.
Big data requires a consistent approach to Web and content management. It’s no secret that most data available in the world today is unstructured. Paradoxically, companies have focused their investments in the systems with structured data that were most closely associated with revenue: line-of-business transactional systems.
To understand big data workflows, you have to understand what a process is and how it relates to the workflow in data-intensive environments. Processes tend to be designed as high level, end-to-end structures useful for decision making and normalizing how things get done in a company or organization. In contrast, workflows are task-oriented and often require more specific data than processes.
The term polyglot is borrowed and redefined for big data as a set of applications that use several core database technologies, and this is the most likely outcome of your implementation planning. The official definition of polyglot is “someone who speaks or writes several languages.” It is going to be difficult to choose one persistence style no matter how narrow your approach to big data might be.
MapReduce is increasingly becoming useful for big data. In the early 2000s, some engineers at Google looked into the future and determined that while their current solutions for applications such as web crawling, query frequency, and so on were adequate for most existing requirements, they were inadequate for the complexity they anticipated as the web scaled to more and more users.
Most big data management professionals are familiar with the need to manage metadata in structured database management environments. These data sources are strongly typed (for example, the first ten characters are the first name) and designed to operate with metadata. You might assume that metadata is nonexistent in unstructured data, but that is not true.
As core components, Hadoop MapReduce and HDFS are constantly being improved and provide starting points for big data, but you need something more. Trying to tackle big data challenges without a toolbox filled with technology and services is like trying to empty the ocean with a spoon. The Hadoop ecosystem provides an ever-expanding collection of tools and technologies created to smooth the development, deployment, and support of big data solutions.
Virtualized big data environments need to be adequately managed and governed to realize cost savings and efficiency benefits. If you rely on big data services to solve your analytics challenges, you need to be assured that the virtual environment is as well managed and secure as the physical environment. Some of the benefits of virtualization, including ease of provisioning, can easily lead to management and security problems without proper oversight.
Big data analysis has gotten a lot of hype recently, and for good reason. You will need to know the characteristics of big data analysis if you want to be a part of this movement. Companies know that something is out there, but until recently, have not been able to mine it. This pushing the envelope on analysis is an exciting aspect of the big data analysis movement.
Even though new sets of tools continue to be available to help you manage and analyze your big data framework more effectively, you may not be able to get what you need. In addition, a range of technologies can support big data analysis and requirements such as availability, scalability, and high performance. Some of these include big data appliances, columnar databases, in-memory databases, nonrelational databases, and massively parallel processing engines.
Existing analytics tools and techniques will be very helpful in making sense of big data. The algorithms that are part of these tools, however, must be able to work with large amounts of potentially real-time and disparate data. A competent infrastructure must be in place to support this. And, vendors providing analytics tools will also need to ensure that their algorithms work across distributed implementations.
Columnar databases can be very helpful in your big data project. Relational databases are row oriented, as the data in each row of a table is stored together. In a columnar, or column-oriented database, the data is stored across rows. Although this may seem like a trivial distinction, it is the most important underlying characteristic of columnar databases.
Is big data really new or is it an evolution in the data management journey? It is actually both. As with other waves in data management, big data is built on top of the evolution of data management practices over the past five decades. What is new is that for the first time, the cost of computing cycles and storage has reached a tipping point.
Data mining involves exploring and analyzing large amounts of data to find patterns for big data. The techniques came out of the fields of statistics and artificial intelligence (AI), with a bit of database management thrown into the mix. Generally, the goal of the data mining is either classification or prediction.
Some big data experts believe that different kinds of data require different forms of protection and that, in some cases in a cloud environment, data encryption might, in fact, be overkill. You could encrypt everything. You could encrypt data, for example, when you write it to your own hard drive, when you send it to a cloud provider, and when you store it in a cloud provider's database.
Big data enables organizations to store, manage, and manipulate vast amounts of disparate data at the right speed and at the right time. To gain the right insights, big data is typically broken down by three characteristics: Volume: How much data Velocity: How fast data is processed Variety: The various types of data While it is convenient to simplify big data into the three Vs, it can be misleading and overly simplistic.
In many cases, big data analysis will be represented to the end user through reports and visualizations. Because the raw data can be incomprehensively varied, you will have to rely on analysis tools and techniques to help present the data in meaningful ways. New applications are coming available and will fall broadly into two categories: custom or semi-custom.
If your company is considering a big data project, it’s important that you understand some distributed computing basics first. There isn’t a single distributed computing model because computing resources can be distributed in many ways. For example, you can distribute a set of programs on the same physical server and use messaging services to enable them to communicate and pass information.
Enterprise Data Management (EDM) is an important process in big data for understanding and controlling the economics of data in your enterprise or organization. Although EDM is not required for big data, the proper application of EDM will help to ensure better integration, control, and usability of big data. EDM is a comprehensive approach to defining, governing, securing, and maintaining the quality of all data involved in the business processes of an organization.
To understand big data, it helps to see how it stacks up — that is, to lay out the components of the architecture. A big data management architecture must include a variety of services that enable companies to make use of myriad data sources in a fast and effective manner. Here's a closer look at what's in the image and the relationship between the components: Interfaces and feeds: On either side of the diagram are indications of interfaces and feeds into and out of both internally managed data and data feeds from external sources.
Big data is only in the first stages, but it is never too early to get started with best practices. As with every important upcoming technology, it is important to have a strategy in place and know where you’re headed. Establish a big data road map At this stage, you have experimented with big data and determined your company’s goals and objectives.
While big data is only in the first stages, you want to plan for success. It is never too early to get started with planning and good practices so that you can leverage what you are learning and the experience you are gaining. Plan your big data goals Many organizations start their big data journey by experimenting with a single project that might provide some concrete benefit.
What does the business plan hope to achieve by leveraging big data? This is not an easy question to answer. Different companies in different industries need to manage their data differently. But some common business issues are at the center of the way that big data is being considered as a way to both plan and execute for business strategy.
When people talk of map and reduce in big data, they do so as operations within a functional programming model. Functional programming is one of the two ways that software developers create programs to address business problems. The other model is procedural programming. Take a quick look to understand the differences and to see when it's best to use one or the other model.
The fundamental structure for graph databases in big data is called “node-relationship.” This structure is most useful when you must deal with highly interconnected data. Nodes and relationships support properties, a key-value pair where the data is stored. These databases are navigated by following the relationships.
The Hadoop Distributed File System is a versatile, resilient, clustered approach to managing files in a big data environment. HDFS is not the final destination for files. Rather, it is a data service that offers a unique set of capabilities needed when data volumes and velocity are high. Because the data is written once and then read many times thereafter, rather than the constant read-writes of other file systems, HDFS is an excellent choice for supporting big data analysis.
To fully understand the capabilities of Hadoop MapReduce, it’s important to differentiate between MapReduce (the algorithm) and an implementation of MapReduce. Hadoop MapReduce is an implementation of the algorithm developed and maintained by the Apache Hadoop project. It is helpful to think about this implementation as a MapReduce engine, because that is exactly how it works.
The power and flexibility of Hadoop for big data are immediately visible to software developers primarily because the Hadoop ecosystem was built by developers, for developers. However, not everyone is a software developer. Pig was designed to make Hadoop more approachable and usable by nondevelopers. Pig is an interactive, or script-based, execution environment supporting Pig Latin, a language used to express data flows.
Sqoop (SQL-to-Hadoop) is a big data tool that offers the capability to extract data from non-Hadoop data stores, transform the data into a form usable by Hadoop, and then load the data into HDFS. This process is called ETL, for Extract, Transform, and Load. While getting data into Hadoop is critical for processing using MapReduce, it is also critical to get data out of Hadoop and into an external data source for use in other kinds of application.
Hadoop’s greatest technique for addressing big data challenges is its capability to divide and conquer with Zookeeper. After the problem has been divided, the conquering relies on the capability to employ distributed and parallel processing techniques across the Hadoop cluster. For some big data problems, the interactive tools are unable to provide the insights or timeliness required to make business decisions.
One benefit of your big data analytics can be fraud prevention. By many estimates, at least 10 percent of insurance company payments are for fraudulent claims, and the global sum of these fraudulent payments amounts to billions or possibly trillions of dollars. While insurance fraud is not a new problem, the severity of the problem is increasing and perpetrators of insurance fraud are becoming increasingly sophisticated.
Big data is most useful if you can do something with it, but how do you analyze it? Companies like Amazon and Google are masters at analyzing big data. And they use the resulting knowledge to gain a competitive advantage. Just think about Amazon's recommendation engine. The company takes all your buying history together with what it knows about you, your buying patterns, and the buying patterns of people like you to come up with some pretty good suggestions.
Big data implementation plans, or road maps, will be different depending on your business goals, the maturity of your data management environment, and the amount of risk your organization can absorb. So, begin your planning by taking into account all the issues that will allow you to determine an implementation road map.
A thoughtful and well-governed approach to security can succeed in mitigating against many security risks. You need to develop a secure big data environment. One thing that you can do is to evaluate your current state. In a big data environment, security starts with assessing your current state. A great place to begin is by answering a set of questions that can help you form your approach to your data security strategy.
High volume, high variety, and high velocity are the essential characteristics of big data. But other characteristics of big data are equally important, especially when you apply big data to operational processes. This second set of “V” characteristics that are key to operationalizing big data includes Validity: Is the data correct and accurate for the intended usage?
Across the world, big data sources for healthcare are being created and made available for integration into existing processes. Clinical trial data, genetics and genetic mutation data, protein therapeutics data, and many other new sources of information can be harvested to improve daily healthcare processes. Social media can and will be used to augment existing data and processes to provide more personalized views of treatment and therapies.
Just having access to big data sources is not enough. You will need to integrate these sources. Soon there will be petabytes of data and hundreds of access mechanisms for you to choose from. But which streams and what kinds of data do you need? Understand the problem you are trying to solve Identify the processes involved Identify the information required to solve the problem Gather the data, process it, and analyze the results This process may sound familiar because businesses have been doing a variation of this algorithm for decades.
Clearly, the very nature of the cloud makes it an ideal computing environment for big data. So how might you use big data together with the cloud? Here are some examples: IaaS in a public cloud: In this scenario, you would be using a public cloud provider’s infrastructure for your big data services because you don’t want to use your own physical infrastructure.
Getting the right perspective on data quality can be very challenging in the world of big data. With the majority of big data sources, you need to assume that you are working with data that is not clean. In fact, the overwhelming abundance of seemingly random and disconnected data in streams of social media data is one of the things that make it so useful to businesses.
It is important to lay a strong architectural foundation if you want to be successful with big data. In addition to supporting the functional requirements, it is important to support the required performance. Your needs will depend on the nature of the analysis you are supporting. You will need the right amount of computational power and speed.
Once you gather your big data, what is your next step? Today customer loyalty is paramount because the customer is in the driver’s seat when it comes to making a choice about how to interact with a service provider. This is true across many industries. The buyer has many more channel options and is increasingly researching purchase decisions and making buying decisions from a mobile device.
Complex Event Processing (CEP) is useful for big data because it is intended to manage data in motion. Complex Event Processing is a technique for tracking, analyzing, and processing data as an event happens. This information is then processed and communicated based on business rules and processes. The idea behind CEP is to be able to establish the correlation between streams of information and match the resulting pattern with defined behaviors such as mitigating a threat or seizing an opportunity.
Sometimes, when approaching big data, companies are faced with huge amounts of data and little idea of where to go next. Enter data streaming. When a significant amount of data needs to be quickly processed in near real time to gain insights, data in motion in the form of streaming data is the best answer. What is data that is not at rest?
MapReduce is a software framework that is ideal for big data because it enables developers to write programs that can process massive amounts of unstructured data in parallel across a distributed group of processors. The map function for big data The map function has been a part of many functional programming languages for years.
Take stock of the type of data you are dealing with in your big data project. Many organizations are recognizing that a lot of internally generated data has not been used to its full potential in the past. By leveraging new tools, organizations are gaining new insight from previously untapped sources of unstructured data in e-mails, customer service records, sensor data, and security logs.
The big data that can make a difference in how companies satisfy their customers and partners is not necessarily in traditional databases any more. The value of unstructured data from nontraditional sources has become apparent. Business leaders have discovered that if they can quickly analyze information that is unstructured — either in the form of text from customer support systems or social media sites — they can gain important insights.
While the worlds of big data and the traditional data warehouse will intersect, they are unlikely to merge anytime soon. Think of a data warehouse as a system of record for business intelligence, much like a customer relationship management (CRM) or accounting system. These systems are highly structured and optimized for specific purposes.
Big data is beginning to have an important impact on business strategy. Because of the increasing importance of big data, keeping data analytics in perspective is good business practice. Companies are beginning to realize that they can begin leveraging data throughout the planning cycle rather than at the end.
By far, the simplest of the NoSQL (not-only-SQL) databases in a big data environment are those employing the key-value pair (KVP) model. KVP databases do not require a schema (like RDBMSs) and offer great flexibility and scalability. KVP databases do not offer ACID (Atomicity, Consistency, Isolation, Durability) capability, and require implementers to think about data placement, replication, and fault tolerance as they are not expressly controlled by the technology itself.
At the lowest level of the big data stack is the physical infrastructure. Your company might already have a data center or made investments in physical infrastructures, so you’re going to want to find a way to use the existing assets. Big data implementations have very specific requirements on all elements in the reference architecture, so you need to examine these requirements on a layer-by-layer basis to ensure that your implementation will perform and scale according to the demands of your business.
Security and privacy requirements, layer 1 of the big data stack, are similar to the requirements for conventional data environments. The security requirements have to be closely aligned to specific business needs. Some unique challenges arise when big data becomes part of the strategy: Data access: User access to raw or computed big data has about the same level of technical requirements as non-big data implementations.
At the core of any big data environment, and layer 2 of the big data stack, are the database engines containing the collections of data elements relevant to your business. These engines need to be fast, scalable, and rock solid. They are not all created equal, and certain big data environments will fare better with one engine than another, or more likely with a mix of database engines.
Organizing data services and tools, layer 3 of the big data stack, capture, validate, and assemble various big data elements into contextually relevant collections. Because big data is massive, techniques have evolved to process the data efficiently and seamlessly. MapReduce is one heavily used technique. Suffice it to say here that many of these organizing data services are MapReduce engines, specifically designed to optimize the organization of big data streams.
The data warehouse, layer 4 of the big data stack, and its companion the data mart, have long been the primary techniques that organizations use to optimize data to help decision makers. Typically, data warehouses and marts contain normalized data gathered from a variety of sources and assembled to facilitate analysis of the business.
Companies are swimming in big data. The problem is that they often don't know how to pragmatically use that data to be able to predict the future, execute important business processes, or simply gain new insights. The goal of your big data strategy and plan should be to find a pragmatic way to leverage data for more predictable business outcomes.
Job scheduling and tracking for big data are integral parts of Hadoop MapReduce and can be used to manage resources and applications. The early versions of Hadoop supported a rudimentary job and task tracking system, but as the mix of work supported by Hadoop changed, the scheduler could not keep up. In particular, the old scheduler could not manage non-MapReduce jobs, and it was incapable of optimizing cluster utilization.
Virtualization separates resources and services from the underlying physical delivery environment, enabling you to create many virtual systems within a single physical system. One of the primary reasons that companies have implemented virtualization is to improve the performance and efficiency of processing of a diverse mix of workloads The big data hypervisor In an ideal world, you don’t want to worry about the underlying operating system and the physical hardware.
Hadoop, an open-source software framework, uses HDFS (the Hadoop Distributed File System) and MapReduce to analyze big data on clusters of commodity hardware—that is, in a distributed computing environment. The Hadoop Distributed File System (HDFS) was developed to allow companies to more easily manage huge volumes of data in a simple and pragmatic way.
Hive is a batch-oriented, data-warehousing layer built on the core elements of Hadoop (HDFS and MapReduce) and is very useful in big data. It provides users who know SQL with a simple SQL-lite implementation called HiveQL without sacrificing access via mappers and reducers. With Hive, you can get the best of both worlds: SQL-like access to structured data and sophisticated big data analysis with MapReduce.
Traditional business intelligence products weren’t really designed to handle big data, so they may require some modification. They were designed to work with highly structured, well-understood data, often stored in a relational data repository and displayed on your desktop or laptop computer. This traditional business intelligence analysis is typically applied to snapshots of data rather than the entire amount of data available.
With the advent of big data, some changes can impact the way you approach business planning. As more businesses begin to use the cloud as a way to deploy new and innovative services to customers, the role of data analysis will explode. You might want to think about another part of your planning process. After you make your initial road map and strategy, you may want to add three more stages to your data cycle: monitoring, adjusting, and experimenting.
Nonrelational databases do not rely on the table/key model endemic to RDBMSs (relational database management systems). In short, specialty data in the big data world requires specialty persistence and data manipulation techniques. Although these new styles of databases offer some answers to your big data challenges, they are not an express ticket to the finish line.
Your big data architecture also needs to perform in concert with your organization’s supporting infrastructure. For example, you might be interested in running models to determine whether it is safe to drill for oil in an offshore area given real-time data of temperature, salinity, sediment resuspension, and a host of other biological, chemical, and physical properties of the water column.
Just having a faster computer isn’t enough to ensure the right level of performance to handle big data. You need to be able to distribute components of your big data service across a series of nodes. In distributed computing, a node is an element contained within a cluster of systems or within a rack. A node typically includes CPU, memory, and some kind of disk.
With the governance challenges presented by big data, it is wise and absolutely necessary to have practices in place to ensure that you are protecting your information. While the degree to which you do these will vary depending on your business, make sure you are taking necessary precautions. Audit your big data process At the end of the day, you have to be able to demonstrate to internal and external auditors that you are meeting the rules necessary to support the operations of the business.
Text analytics can be used to help gain insight into data. So, what if the data is big data? That would mean that the unstructured data being analyzed is high volume, high velocity, or both. Big data and the voice of the customer Optimizing the customer experience and improving customer retention are dominant drivers for many service industries.
How will you know how to put all of your data together? With a big data project, what you want to do with your structured and unstructured data indicates why you might choose one piece of technology over another one. It also determines the need to understand inbound data structures to put this data in the right place.
Typically, companies begin their journey to big data by starting with an organizational experiment to see whether big data can play an important role in defining and impacting business strategy. However, after it becomes clear that big data will have a strategic role as part of the information management environment, you have to make sure that the right structure is in place to support and protect the organization.
Big data is becoming an important element in the way organizations are leveraging high-volume data at the right speed to solve specific data problems. Relational Database Management Systems are important for this high volume. Big data does not live in isolation. To be effective, companies often need to be able to combine the results of big data analysis with the data that exists within the business.
While companies are very concerned about the security and governance of their data in general, big data initiatives come with certain complexities and unforeseen issues that many companies are not prepared to handle. Often big data analysis is conducted with a vast array of data sources that might come from many unvetted sources.
So, how do you get started in your journey to creating the right environment so that you are ready to both experiment with big data and be prepared to expand your use of big data when you are ready? Will you have to invest in new technologies for your data center? Can you leverage cloud computing services? The answer to these questions is yes.
Spatial databases can be an important tool in your big data project. Spatial data itself is standardized through the efforts of the Open Geospatial Consortium (OGC), which establishes OpenGIS (Geographic Information System) and a number of other standards for spatial data. Whether you know it or not, you may interact with spatial data every day.
HBase is a distributed, nonrelational (columnar) database that utilizes HDFS as its persistence store for big data projects. It is modeled after Google BigTable and is capable of hosting very large tables (billions of columns/rows) because it is layered on Hadoop clusters of commodity hardware. HBase provides random, real-time read/write access to big data.
The term structured data generally refers to data that has a defined length and format for big data. Examples of structured data include numbers, dates, and groups of words and numbers called strings. Most experts agree that this kind of data accounts for about 20 percent of the data that is out there. Structured data is the data you’re probably used to dealing with.
Many companies that are beginning their exploration of big data are in the early stages of execution. Consider these do’s and don’ts as part of your strategy. Most companies are experimenting with pilots to see whether they can leverage big data sources to transform decision making. It is easy to make mistakes that can cause disruptions in your business strategy.
As you enter the world of big data, you'll need to absorb many new types of database and data-management technologies. Here are the top-ten big data trends: Hadoop is becoming the underpinning for distributed big data management. Hadoop is a distributed file system that can be used in conjunction with MapReduce to process and analyze massive amounts of data, enabling the big data trend.
Here is an overview of some of the players in the text analysis big data market. Some are small while others are household names. Some call what they do big data text analytics, while some just refer to it as text analytics. Attensity for big data Attensity is one of the original text analytics companies that began developing and selling products more than ten years ago.
Numerous methods exist for analyzing unstructured data for your big data initiative. Historically, these techniques came out of technical areas such as Natural Language Processing (NLP), knowledge discovery, data mining, information retrieval, and statistics. Text analytics is the process of analyzing unstructured text, extracting relevant information, and transforming it into structured information that can then be leveraged in various ways.
Data governance is important to your company no matter what your big data sources are or how they are managed. In the traditional world of data warehouses or relational database management, it is likely that your company has well-understood rules about how data needs to be protected. For example, in the healthcare world, it is critical to keep patient data private.
You’ll find a nuance about big data analysis. It’s really about small data. While this may seem confusing and counter to the whole premise, small data is the product of big data analysis. This is not a new concept, nor is it unfamiliar to people who have been doing data analysis for any length of time. The overall working space is larger, but the answers lie somewhere in the “small.
Cloud computing is a method of providing a set of shared computing resources and is becoming increasingly important for your big data initiative. The cloud includes applications, computing, storage, networking, development, and deployment platforms, as well as business processes. Cloud computing turns traditional siloed computing assets into shared pools of resources based on an underlying Internet foundation.
As computing moved into the commercial market, data was stored in flat files that imposed no structure. Today, big data requires manageable data structures. When companies needed to get to a level of detailed understanding about customers, they had to apply brute-force methods, including very detailed programming models to create some value.
The best way to understand the economics of big data is to look at the various methods for putting big data to work for your organization. While specific costs may vary due to the size of your organization, its purchasing power, vendor relationships, and so on, the classes of expense are fairly consistent. Big data types and sources The most important decisions you need to make with respect to types and sources are What data will be necessary to address your business problem?
With the advent of big data, the deployment models for managing data are changing. The traditional data warehouse is typically implemented on a single, large system within the data center. The costs of this model have led organizations to optimize these warehouses and limit the scope and size of the data being managed.
Behind all the important trends over the past decade, including service orientation, cloud computing, virtualization, and big data, is a foundational technology called distributed computing. Simply put, without distributing computing, none of these advancements would be possible. Distributed computing is a technique that allows individual computers to be networked together across geographical areas as though they were a single environment.
The fundamental elements of the big data platform manage data in new ways as compared to the traditional relational database. This is because of the need to have the scalability and high performance required to manage both structured and unstructured data. Components of the big data ecosystem ranging from Hadoop to NoSQL DB, MongoDB, Cassandra, and HBase all have their own approach for extracting and loading data.
The data warehouse market has indeed begun to change and evolve with the advent of big data. In the past, it was simply not economical for companies to store the massive amount of data from a large number of systems of record. The lack of cost-effective and practical distributed computing architectures meant that a data warehouse was designed so it could be optimized to operate on a single unified system.
Both streaming data and Complex Event Processing have an enormous impact on how companies can make strategic use of big data. With streaming data, companies are able to process and analyze this data in real time to gain an immediate insight. It often requires a two-step process to continue to analyze the key findings that might have gone unnoticed in the past.
Solving big data challenges requires the management of large volumes of highly distributed data stores along with the use of compute- and data-intensive applications. Virtualization provides the added level of efficiency to make big data platforms a reality. Although virtualization is technically not a requirement for big data analysis, software frameworks are more efficient in a virtualized environment.
ETL tools combine three important functions (extract, transform, load) required to get data from one big data environment and put it into another data environment. Traditionally, ETL has been used with batch processing in data warehouse environments. Data warehouses provide business users with a way to consolidate information to analyze and report on data relevant to their business focus.
Knowing what data is stored and where it is stored are critical building blocks in your big data implementation. It's unlikely that you'll use RDBMSs for the core of the implementation, but it's very likely that you'll need to rely on the data stored in RDBMSs to create the highest level of value to the business with big data.
A primary consideration when undertaking a big data project is the projected amount of real-time and non-real-time required to carry out your initiative. Big data is often about doing things that weren’t possible because the technology was not advanced enough or the cost was prohibitive. The big change happening with big data is the capability to leverage massive amounts of data without all the complex programming required in the past.
What does your business now do with all the data in all its forms? Big data requires many different approaches to analysis, traditional or advanced, depending on the problem being solved. Some analyses will use a traditional data warehouse, while other analyses will take advantage of advanced predictive analytics.
Unstructured data is different than structured data in that its structure is unpredictable. Examples of unstructured data include documents, e-mails, blogs, digital images, videos, and satellite imagery. It also includes some data generated by machines or sensors. In fact, unstructured data accounts for the majority of data that's on your company's premises as well as external to your company in online private and public sources such as Twitter and Facebook.
Unstructured data is data that does not follow a specified format for big data. If 20 percent of the data available to enterprises is structured data, the other 80 percent is unstructured. Unstructured data is really most of the data that you will encounter. Until recently, however, the technology didn’t really support doing much with it except storing it or analyzing it manually.
Warning! Cloud-based services can provide an economical solution to your big data needs, but the cloud has its issues. It’s important to do your homework before moving your big data there. Here are some issues to consider: Data integrity: You need to make sure that your provider has the right controls in place to ensure that the integrity of your data is maintained.
Big data is not a single technology but a combination of old and new technologies that helps companies gain actionable insight. Therefore, big data is the capability to manage a huge volume of disparate data, at the right speed, and within the right time frame to allow real-time analysis and reaction. Big data is typically broken down by three characteristics: Volume: How much data Velocity: How fast that data is processed Variety: The various types of data Although it’s convenient to simplify big data into the three Vs, it can be misleading and overly simplistic.
You need data in motion to react quickly with the current state of big data. To complete a credit card transaction or send an e-mail, data needs to be transported from one location to another. Data is at rest when it is stored in a database in your data center or the cloud. In contrast, data is in motion when it is in transit from one resting location to another.
Search engine innovators like Yahoo! and Google were faced with a bog data problem. They needed to find a way to make sense of the massive amounts of data that their engines were collecting. These companies needed to both understand what information they were gathering and how they could monetize that data to support their business model.
Numerous combinations of deployment and delivery models exist for big data in the cloud. For example, you can utilize a public cloud IaaS or a private cloud IaaS. So, what does this mean for big data and why is the cloud a good fit for it? Well, big data requires distributed clusters of compute power, which is how the cloud is architected.
https://cdn.prod.website-files.com/6630d85d73068bc09c7c436c/69195ee32d5c606051d9f433_4.%20All%20For%20You.mp3

Frequently Asked Questions

No items found.