Data preparation in SPSS Statistics
Data preparation is an integral part of every research project and is often the most time-consuming activity in a project. Different projects will require different types of data preparation, so there is no prescribed sequence in which data preparation tasks should be undertaken.
The following table lists some of the most common data preparation tasks, along with the SPSS submenu that will help you with these data preparation activity.
Data and Transform Menu Procedures
Activity | Submenu(s) | Useful For |
Selecting a subset of cases | Select Cases or Split File | Running an analysis on only a portion of the data (such as customers who live in a particular region) |
Identifying unusual cases | Identify Unusual Cases or Sort Cases | Sorting cases in ascending or descending order based on the values of one or more variables to view extreme cases |
Removing duplicate cases | Identify Duplicate Cases | Identifying an individual who appears several times in the same dataset |
Recoding data values | Recode into Different Variables or Recode into the Same Variable (not recommended) | Modifying a 7-point customer satisfaction survey into the responses (negative, neutral, or positive) after data inspection |
Combining data files | Merge Files Add Cases or Merge Files Add Variables | Combining data that is kept in different locations but must be combined before data analysis can begin |
Creating new variables | Compute Variable | Extracting additional information or insight from the variables originally in the dataset |
Counting occurrences | Count Values within Cases | Counting how often something of interest occurs |
Calculating with date and time variables | Data and Time Wizard | Calculating the amount of time that has passed between time points |
Transforming string to numeric values | Automatic Recode | Modifying string variables so they can be used in more analyses |
Creating groups from continuous data | Visual Binning | Creating groups out of scale variables (income groups from income) |
Calculating summaries across cases | Aggregate | Creating the appropriate level of analysis for the data (taking transactional data so it can be analyzed at the customer level) |
Changing the structure of the data file | Restructure or Transpose | Useful for making variables into cases or cases into variables |
Effects of measurement level
The level of measurement of a variable determines the appropriate summary statistics and graphs to describe data. The following table summarizes the most common summary measures and graphs for each measurement level.
Level of Measurement
Nominal | Ordinal | Scale | |
Definition | Unordered categories | Ordered categories | Numeric values |
Examples | Gender, geographic location, job category | Satisfaction ratings, income groups, ranking of preferences | Number of purchases, cholesterol level, age |
Measures of central tendency | Mode | Median | Median or mean |
Measures of dispersion | None | Min/max/range
|
Min/max/range,
Standard deviation/ variance |
Graph | Pie or bar | Bar | Histogram or box and whiskers plot
|
The prior table showed how level of measurement determines the type of graph you can use to display individual variables. The following table shows which types of graphs are appropriate for different variable combinations.
Graphs for Variable Combinations
Categorical Dependent | Scale Dependent | |
Categorical Independent | Clustered bar or paneled pie | Error bar or boxplot |
Scale Independent | Error bar or boxplot | Scatter plot |
Reviewing the data file for the first time in SPSS Statistics
After you have your data, you are ready to start exploring it and becoming familiar with its characteristics. Start by reviewing the distribution of each variable and checking the number of valid cases.
When you have a categorical variable, it’s important to know the number of unique values and to make sure there are no more or fewer categories than expected. It’s important also to determine how the cases are distributed among the categories of a variable.
Look for categories that have either very few or very many cases. Either situation could cause problems when analyzing the data, so you may need to exclude those values or combine them with other values (but only if it makes sense) to build a valid analysis.
For continuous variables, check for unusual distributions such as bimodality or a high degree of skewness. Also look at summary statistics and note if there are any deviations from what you expect (lower minimums, higher maximums, different means, or more or less variation in the data values).
Finally, you can easily spot potential problems in data that otherwise appears valid by asking a series of questions:
- Does the distribution of the variable make sense?
- Is this what you were expecting?
- Do you notice any errors?
- Do you notice any unusual values?
- Will you have any potential problems when analyzing this data?