To make really huge datasets useful, you must transform them in some manner. Again, depending on whom you talk to, transformation can take on all sorts of meanings. The one meaning that isn’t discussed here is the modification of data such that it implies one thing when it actually said something else at the outset (think of this as spin doctoring the data). In fact, it’s a good idea to avoid this sort of data manipulation entirely because you can end up with completely unpredictable results when performing analysis, even if those results initially look promising and even say what you feel they should say.Types of Data Manipulation Used in Functional Programming
Another kind of data transformation actually does something worthwhile. In this case, the meaning of the data doesn’t change; only the presentation of the data changes. You can separate this kind of data transformation into a number of methods that include (but aren’t necessarily limited to) tasks such as the following:- Cleaning: As with anything else, data gets dirty. You may find that some of it is missing information and some of it may actually be correct but outdated. In fact, data becomes dirty in many ways, and you always need to clean it before you can use it. Machine Learning For Dummies, by John Paul Mueller and Luca Massaron (Wiley), discusses the topic of cleaning in considerable detail.
- Verification: Establishing that data is clean doesn’t mean that the data is correct. A dataset may contain many entries that seem correct but really aren’t. For example, a birthday may be in the right form and appear to be correct until you determine that the person in question is more than 200 years old. A part number may appear in the correct form, but after checking, you find that your organization never produced a part with that number. The act of verification helps ensure the veracity of any analysis you perform and generates fewer outliers to skew the results.
- Data typing: Data can appear to be correct and you can verify it as true, yet it may still not work. A significant problem with data is that the type may be incorrect or it may appear in the wrong form. For example, one dataset may use integers for a particular column (feature), while another uses floating-point values for the same column. Likewise, some datasets may use local time for dates and times, while others might use GMT. The transformation of the data from various datasets to match is an essential task, yet the transformation doesn’t actually change the data’s meaning.
- Form: Datasets come with many form issues. For example, one dataset may use a single column for people’s names, while another might use three columns (first, middle, and last), and another might use five columns (prefix, first, middle, last, and suffix). The three datasets are correct, but the form of the information is different, so a transformation is needed to make them work together.
- Range: Some data is categorical or uses specific ranges to denote certain conditions. For example, probabilities range from 0 to 1. In some cases, there isn’t an agreed-upon range. Consequently, you find data appearing in different ranges even though the data refers to the same sort of information. Transforming all the data to match the same range enables you to perform analysis by using data from multiple datasets.
- Baseline: You hear many people talk about dB when considering audio output in various scenarios. However, a decibel is simply a logarithmic ratio. Without a reference value or a baseline, determining what the dB value truly means is impossible. For audio, the dB is referenced to 1 volt (dBV). The reference is standard and therefore implied, even though few people actually know that a reference is involved. Now, imagine the chaos that would result if some people used 1 volt for a reference and others used 2 volts. dBV would become meaningless as a unit of measure. Many kinds of data form a ratio or other value that requires a reference. Transformations can adjust the reference or baseline value as needed so that the values can be compared in a meaningful way.
You can come up with many other data transformations. The point of this information is to point out that the method used determines the kind of data transformation that occurs, and you must perform certain kinds of transformations to make data useful. Applying an incorrect transformation or the correct transformation in the wrong way will result in useless output even when the data itself is correct.