Data governance is important to your company no matter what your big data sources are or how they are managed. In the traditional world of data warehouses or relational database management, it is likely that your company has well-understood rules about how data needs to be protected.
For example, in the healthcare world, it is critical to keep patient data private. You may be able to store and analyze data about patients as long as names, Social Security numbers, and other personal data is masked. You have to make sure that unauthorized individuals cannot access private or restricted data.
What happens when you flood your environment with big data sources that come from a variety of sources? Some of these sources will come from commercial third-party vendors that have carefully vetted the data and masked out sensitive data.
However, it is quite likely that the big data sources may be insecure and unprotected, and include a lot of personal data. During initial processing of this data, you will probably analyze lots of data that will not turn out to be relevant to your organization. Therefore, you don’t want to invest resources to protect and govern data that you do not intend to retain.
If sensitive personal data passes across your network, you may expose your company to unanticipated compliance requirements. For data that is truly exploratory, with unknown contents, it might be safer to perform the initial analysis in a “walled” environment that is internal but segmented, or in the cloud.
Finally, after you decide that a subset of that data is going to be analyzed more deeply so that results may be incorporated into your business process, it is important to institute a process of carefully applying governance requirements to that data.
What issues should you consider when you incorporate these unvetted sources into your environment? Consider the following:
Determine beforehand who is allowed to access new data sources initially as well as after the data has been analyzed and understood.
Understand how this data will be segregated from other companies' data.
Understand what your responsibility is to leverage the data. If the data is privately owned, you have to make sure that you are adhering to contracts or rules of use. Some data may be linked to a usage contract with a vendor.
Understand where your data will be physically located. You may include data that is linked to customers or prospects in specific countries that have strict privacy requirements. You need to be aware of the details of these sources to avoid violating regulations.
Understand how your data needs to be treated if it is physically moved from one location to another. Are you going to store some of this data with a cloud provider? What type of promises will that provider offer in terms of where the data will be stored, and how well it will be secured?
Just because you have created a security and governance process for your traditional data sources doesn’t mean that you can assume that employees and partners will expand those rules to new data sources. You need to consider two key issues: visibility of the data and the trust of those working with the data.
Visibility: While business analysts and partners you are working with may be eager to use these new data sources, you may not be aware of how this data will be used and controlled. In other words, you may not have control over your visibility into your resources that are running outside of your control.
This situation is especially troublesome if you need to ensure that your provider is following compliance regulations or laws. This is also true when you are using a cloud provider to manage that data because the storage may be very inexpensive to manage.
Unvetted employees: Although your company may go through an extensive background check on all of its employees, you're now trusting that no malicious insiders work in various business units outside of IT. You also have to assume that your cloud provider has diligently checked its employees.
This concern is real because close to 50 percent of security breaches are caused by insiders. If your company is going to use these new data sources in a highly distributed manner, you need to have a plan to deal with inside as well as outside threats.
You have a responsibility to make sure that your new big data sources do not open your company to unanticipated threats or governance risks. It is your responsibility to have good security, governance processes, and education in place across your entire information management environment.
As with any technology life cycle, you need to have a process for assessing the capability of your organization to meet the readiness of all constituents to follow security and governance requirements. You may already have processes for data security, privacy, and governance in place for your existing structured databases and data warehouses. These processes need to be extended for your big data implementation.