Columnar Databases in a Big Data Environment

Statistics for Big Data For Dummies

Columnar databases can be very helpful in your big data project. Relational databases are row oriented, as the data in each row of a table is stored together. In a columnar, or column-oriented database, the data is stored across rows. Although this may seem like a trivial distinction, it is the most important underlying characteristic of columnar databases.

It is very easy to add columns, and they may be added row by row, offering great flexibility, performance, and scalability. When you have volume and variety of data, you might want to use a columnar database. It is very adaptable; you simply continue to add columns.

One of the most popular columnar databases is HBase. It, too, is a project in the Apache Software Foundation distributed under the Apache Software License v2.0. HBase uses the Hadoop file system and MapReduce engine for its core data storage needs.

The design of HBase is modeled on Google’s BigTable. Therefore, implementations of HBase are highly scalable, sparse, distributed, persistent multidimensional sorted maps. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

When your big data implementation requires random, real-time read/write data access, HBase is a very good solution. It is often used to store results for later analytical processing.

Important characteristics of HBase include the following:

Consistency: Although not an “ACID” implementation, HBase offers strongly consistent reads and writes and is not based on an eventually consistent model. This means you can use it for high-speed requirements as long as you do not need the “extra features” offered by RDBMS like full transaction support or typed columns.
Sharding: Because the data is distributed by the supporting file system, HBase offers transparent, automatic splitting and redistribution of its content.
High availability: Through the implementation of region servers, HBase supports LAN and WAN failover and recovery. At the core, there is a master server responsible for monitoring the region servers and all metadata for the cluster.
Client API: HBase offers programmatic access through a Java API.
Support for IT operations: Implementers can expose performance and other metrics through a set of built-in web pages.