To understand big data workflows, you have to understand what a process is and how it relates to the workflow in data-intensive environments. Processes tend to be designed as high level, end-to-end structures useful for decision making and normalizing how things get done in a company or organization.
In contrast, workflows are task-oriented and often require more specific data than processes. Processes are comprised of one or more workflows relevant to the overall objective of the process.
In many ways, big data workflows are similar to standard workflows. In fact, in any workflow, data is necessary in the various phases to accomplish the tasks. Consider the workflow in a healthcare situation.
One elementary workflow is the process of “drawing blood.” Drawing blood is a necessary task required to complete the overall diagnostic process. If something happens and blood has not been drawn or the data from that blood test has been lost, it will be a direct impact on the veracity or truthfulness of the overall activity.
What happens when you introduce a workflow that depends on a big data source? Although you might be able to use existing workflows, you cannot assume that a process or workflow will work correctly by just substituting a big data source for a standard source. This may not work because standard data-processing methods do not have the processing approaches or performance to handle the complexity of the big data.
The healthcare example focuses on the need to conduct an analysis after the blood is drawn from the patient. In the standard data workflow, the blood is typed and then certain chemical tests are performed based on the requirements of the healthcare practitioner.
It is unlikely that this workflow understands the testing required for identifying specific biomarkers or genetic mutations. If you supplied big data sources for biomarkers and mutations, the workflow would fail. It is not big data aware and will need to be modified or rewritten to support big data.
The best practice for understanding workflows and the effect of big data is to do the following:
Identify the big data sources you need to use.
Map the big data types to your workflow data types.
Ensure that you have the processing speed and storage access to support your workflow.
Select the data store best suited to the data types.
Modify the existing workflow to accommodate big data or create new big data workflow.
After you have your big data workflows, it will be necessary to fine-tune these so they won’t overwhelm or contaminate your analysis. For example, many big data sources do not include well-defined data definitions and metadata about the elements of those sources. Sometimes, these data sources have not been cleaned. You need to make sure you have the right level of knowledge about the sources you are going to use.