Data Aggregation

Data Science Essentials For Dummies

Summarizing data, finding totals, and calculating averages and other descriptive measures are probably not new to you. When you need your summaries in the form of new data, rather than reports, the process is called aggregation. Aggregated data can become the basis for additional calculations, merged with other datasets, used in any way that other data is used.

Here’s an example of a data aggregation process. A dataset contains general information about over 160,000 parcels of real estate. This data includes a variety of land uses. What if you’d like to see the average assessed value for the land in each land-use category? Here’s how you’d do it.

You’d find the data aggregation tool in your data-mining application. You might use search to find it.

You’d add the tool to a process and connect it to a source dataset.

In the data aggregation tool, you’d choose a grouping variable. In this case, it’s the Land Use variable, C_A_CLASS.

Then you’d define the summaries you want. To get average assessed value of the land, you’ll select the variable with the assessments to summarize and choose the average function.

When the aggregation is executed, the result is a new dataset, with one row for each type of land use and a new variable for the calculated averages.

Sooner or later, you’ll need to aggregate a whole dataset. But when you want to total or average all the data in a dataset, you may run into a problem: What’s your grouping variable? The trick is to use a variable with a constant value for the whole dataset. So, create a variable where every value is the same, and then use it as your grouping variable.

About This Article

About the book author:

Meta S. Brown helps organizations use practical data analysis to solve everyday business problems. A hands-on data miner who has tackled projects with up to $900 million at stake, she is a recognized expert in cutting-edge business analytics.