Articles From David Unger
Filter Results
Article / Updated 07-14-2021
When you make a decision in a hypothesis test, there’s never a 100 percent guarantee you’re right. You must be cautious of Type I errors (rejecting a true claim) and Type II errors (failing to reject a false claim). Instead, you hope that your procedures and data are good enough to properly reject a false claim. The probability of correctly rejecting H0 when it is false is known as the power of the test. The larger it is, the better. Suppose you want to calculate the power of a hypothesis test on a population mean when the standard deviation is known. Before calculating the power of a test, you need the following: The previously claimed value of in the null hypothesis, The one-sided inequality of the alternative hypothesis (either < or >), for example, The mean of the observed values The population standard deviation The sample size (denoted n) The level of significance To calculate power, you basically work two problems back-to-back. First, find a percentile assuming that H0 is true. Then, turn it around and find the probability that you’d get that value assuming H0 is false (and instead Ha is true). Assume that H0 is true, and Find the percentile value corresponding to sitting in the tail(s) corresponding to Ha. That is, if then find b where If then find b where Assume that H0 is false, and instead Ha is true. Since under this assumption, then let in the next step. Find the power by calculating the probability of getting a value more extreme than b from Step 2 in the direction of Ha. This process is similar to finding the p-value in a test of a single population mean, but instead of using you use Suppose a child psychologist says that the average time that working mothers spend talking to their children is 11 minutes per day. You want to test versus You conduct a random sample of 100 working mothers and find they spend an average of 11.5 minutes per day talking with their children. Assume prior research suggests the population standard deviation is 2.3 minutes. When conducting this hypothesis test for a population mean, you find that the p-value = 0.015, and with a level of significance of you reject the null hypothesis. But there are a lot of different values of (not just 11.5) that would lead you to reject H0. So how strong is this specific test? Find the power. Assume that H0 is true, and Find the percentile value corresponding to sitting in the upper tail. If p(Z > zb) = 0.05, then zb = 1.645. Further, Assume that H0 is false, and instead Find the power by calculating the probability of getting a value more extreme than b from Step 2 in the direction of Ha. Here, you need to find p(Z > z) where Using the Z-table, you find that Hopefully, you were already feeling good about your decision to reject the null hypothesis since the p-value of 0.015 was significant at an of 0.05. Further, you found that Power = 0.6985, meaning that there was nearly a 70 percent chance of correctly rejecting a false null hypothesis. This is just one power calculation based on a single sample generating a mean of 11.5. Statisticians often calculate a “power curve” based on many likely alternative values. Also, there are some unique considerations to take into account if but this gives you the gist of things.
View ArticleArticle / Updated 03-26-2016
You use hypothesis tests to challenge whether some claim about a population is true (for example, a claim that 90 percent of Americans own a cellphone). To test a statistical hypothesis, you take a sample, collect data, form a statistic, standardize it to form a test statistic, and decide whether the test statistic refutes the claim. The following table lays out the important details for hypothesis tests. Note that for the tests involving the difference of two population values it’s typical that is 0. You can also use these critical z*-values for hypothesis tests in which the test statistic follows a Z-distribution. If the absolute value of the test statistic is greater than the corresponding z*-value, then reject the null hypothesis.
View ArticleArticle / Updated 03-26-2016
Critical values (z*-values) are an important component of confidence intervals (the statistical technique for estimating population parameters). The z*-value, which appears in the margin of error formula, measures the number of standard errors to be added and subtracted in order to achieve your desired confidence level (the percentage confidence you want). The following table shows common confidence levels and their corresponding z*-values. Confidence Level z*-value 80% 1.28 85% 1.44 90% 1.645 95% 1.96 98% 2.33 99% 2.58 You can also use these critical z*-values for hypothesis tests in which the test statistic follows a Z-distribution. If the absolute value of the test statistic is greater than the corresponding z*-value, then reject the null hypothesis.
View ArticleArticle / Updated 03-26-2016
In a nutshell, the Central Limit Theorem says you can use the normal distribution to describe the behavior of a sample mean even if the individual values that make up the sample mean are not normal themselves. But this is only possible if the sample size is “large enough.” Many statistics textbooks would tell you that n would have to be at least 30. But why is n = 30 the benchmark? Many variables in nature, finance, and other applications have a distribution that’s very close to the normal curve. For example, by looking at the t-table, you see that the various values of t start to get really close to the values of z by the time you hit about 30 degrees of freedom. One reason for this is that the t-distributions and the normal distribution share two important characteristics: They are symmetric, and they are unimodal (having one peak). If the distribution of your individual data values is far off from either of these qualities, you might need more than a sample size of 30 to use the Central Limit Theorem. The further away the data is from being symmetric and unimodal, the more data you’ll need. Symmetry If you know or suspect that your parent distribution is not symmetric about the mean, then you may need a sample size that’s significantly larger than 30 to get the possible sample means to look normal (and thus use the Central Limit Theorem). Consider the following right-skewed histogram, which records the number of pets per household. Now, suppose it represents the entire population of households. You repeatedly sample n = 30 households from that population. Here is what distribution of possible sample means looks like. You can see that this distribution is not normal because the right tail still stretches out farther from the central peak than the left tail does. It’s not symmetric. For this population, you need to take a sample of around n = 100 to get the sample means to settle into a symmetric curve. Unimodal If you know or suspect that your parent distribution is not unimodal and has more than one peak, then you might need more than 30 in your sample to feel good about using the Central Limit Theorem. Consider the following multimodal population histogram with three distinct peaks. If you only sample n = 30 from that population, you do get a unimodal distribution, but it’s still not quite symmetric. For this population, you need to take a sample of at least n = 50 to feel comfortable that your sample mean distribution is roughly normal.
View ArticleArticle / Updated 03-26-2016
The mean and median are the two most reliable and commonly reported measures of the center, and they are used in a wide variety of situations. However, if you’re seriously studying statistics, you should be familiar with two other measures of central tendency. Mode The mode is another measure of center that calculates which value (or range of values) occurs most frequently. The mean and median can be very effective at describing symmetric and unimodal distributions. The mode is useful for explaining situations that the mean and median cannot, particularly skewed or multimodal data. To calculate the mode, you simply create a frequency table of all possible values and count the number of times each appears. For example, if the data set contains 10, 20, 20, 20, 30, 30, 40, 50, 50; then the mode is 20. If you have a data set that doesn’t have values that are repeated exactly, you can split them into ranges similar to the way you prepare for making a histogram. For example, in the following table, two players on the Lakers are making the NBA league minimum, so the mode could be considered to be $959,111. Alternately, you could split the data into groups of $1 million, in which case the mode would be the range from $5–6 million because four players fall into that group. Salaries for L.A. Lakers NBA Players (2009–2010) Player Salary ($) Kobe Bryant 23,034,375 Pau Gasol 16,452,000 Andrew Bynum 12,526,998 Lamar Odom 7,500,000 Ron Artest 5,854,000 Adam Morrison 5,257,229 Derek Fisher 5,048,000 Sasha Vujacic 5,000,000 Luke Walton 4,840,000 Shannon Brown 2,000,000 Jordan Farmar 1,947,240 Didier Ilunga-Mbenga 959,111 Josh Powell 959,111 Total 91,378,064 The mode can be visualized by the peak in a histogram. With data sets that have multiple peaks, it’s not uncommon to report multiple modes because the mean and the median may not accurately reflect where most of the values lie. Trimmed mean You’ve seen that the mean is susceptible to outliers and will be “pulled” toward the most extreme values. The trimmed mean (or truncated mean) tries to eliminate the influence of outliers by trimming off a small number of extreme values so the mean focuses more on the most central values. To calculate a trimmed mean, you choose a small percentage of your data set (say, 10 percent), split that number in half, remove the corresponding percentage of values from both the low and high ends, and then calculate the mean of the remaining values. For example, suppose a data set contains the following n = 20 values: 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 9, 500. The outlier value of 500 drives the (traditional) sample mean to be 29.6, which is larger than all but one of the data values and not really indicative of where all the action is. Instead, you can cut out the most extreme 10 percent, which means removing two values (10% x 20 = 20), and just calculate a mean based on the middle 90 percent of values. Since you have to split that two between the two ends, you’ll remove one from the low end (3) and one from the high end (500). The 90 percent trimmed mean based on the remaining 18 data values is 4.9 and better reflects the central trend of the data.
View ArticleArticle / Updated 03-26-2016
Simple random samples are the best way to get an unbiased, representative selection of individuals to be a part of a study. However, the process for generating a simple random sample is akin to having everyone’s name in a hat and then pulling out slips of paper until you fill your sample. How often do you really have access to the names of everyone in a population? It’s hard enough to get a list of all people in a population. How much more difficult is it for sampling wild animals or crops on a farm? In many practical situations, simple random sampling isn’t feasible, isn’t cost effective, and is sometimes downright impossible. Other situations call for a more sophisticated method of sampling. Here are two other methods statisticians use for generating useful samples. Multistage sampling Multistage sampling is used when simple random sampling is impractical. This method breaks the whole population down into subgroups using simple random sampling at each stage. For example, to poll the reidents of a state on an upcoming election, you could begin by randomly selecting half of the counties in the state. Then randomly select five of the zip codes within each selected county. Finally, randomly select ten addresses within each of the selected zip codes. With this method, you don’t need all the household addresses up front as you would with a simple random sample. Additionally, multistage sampling can be an effective way to save time and resources. Some data collection requires the researcher to be there in person with the subjects, such as medical studies. You wouldn’t necessarily need to travel all over the place in a scattered manner based on completely random sampling. Instead, you would just go to the selected areas given by the multistage sampling. Stratified sampling A stratified random sample is a slightly more complex form of random sampling. Stratified sampling is essentially a series of simple random samples on subgroups of a given population. The subgroups are chosen because they are somehow important to the study or experiment. For example, if you’re conducting a study on the differences between how men and women would react to a certain news story, a simple random sample could result in there being too many of one group and not enough (or none at all!) of the other. With a stratified sample, you could dictate that you want a specific number of both men and women in the study. You could random sample equally from within each sugroup in a stratified sample, but in certain situations, you may want to use unequally sized samples.
View ArticleArticle / Updated 03-26-2016
Formulas — you just can’t get away from them when you’re studying statistics. Here are ten statistical formulas you’ll use frequently and the steps for calculating them. Proportion Some variables are categorical and identify which category or group an individual belongs to. For example, “relationship status” is a categorical variable, and an individual could be single, dating, married, divorced, and so on. The actual number of individuals in any given category is called the frequency for that category. A proportion, or relative frequency, represents the percentage of individuals that falls into each category. The proportion of a given category, denoted by p, is the frequency divided by the total sample size. So to calculate the proportion, you Count up all the individuals in the sample who fall into the specified category. Divide by n, the number of individuals in the sample. Mean The mean, or the average of a data set, is one way to measure the center of a numerical data set. The notation for the mean is The formula for the mean is where x represents each of the values in the data set. To calculate the mean, you Add up all the numbers in the data set. Divide by n, the number of values in the data set. Median The median of a numerical data set is another way to measure the center. The median is the middle value after you order the data from smallest to largest. To calculate the median, go through the following steps: Order the numbers from smallest to largest. For an odd amount of numbers, choose the one that falls exactly in the middle. You’ve pinpointed the median. For an even amount of numbers, take the two numbers exactly in the middle and average them to find the median. Sample standard deviation The standard deviation of a sample is a measure of the amount of variability in the sample. You can think of it, in general terms, as the average distance from the mean. The formula for the standard deviation is To calculate the standard deviation, you Find the average of all the numbers, Take each number and subtract the average from it. Square each of the resulting values. Add them all up. Divide by n – 1. Take the square root. Percentile Percentiles are a way to determine an individual value relative to all the other values in a data set. When taking a standardized test, you get an individual raw score and a percentile. If you come in at the 90th percentile, for example, 90 percent of the test scores of all students are the same as or below yours (and 10 percent are above yours). In general, being at the kth percentile means k percent of the data lie at or below that point and (100 – k) percent lie above it. To calculate a percentile, you Convert the original value to a standard score by using the z-formula, where x is the original value, is the population mean of all values, and is the population standard deviation of all values. Use the Z-table to find the corresponding percentile for the standard score. Margin of error for the sample mean The margin of error for your sample mean, is the amount you expect the sample mean to vary from sample to sample. The formula for the margin of error for dealing with samples of size 30 or more, is where z* is the standard normal value for the confidence level you want. To calculate the margin of error for you Determine the confidence level and find the appropriate z*. Find the standard deviation and the sample size, n. Multiply z* by divided by the square root of n. Sample size needed If you want to calculate a confidence interval for the population mean with a certain margin of error, you can figure out the sample size you need before you collect any data. The formula for the sample size for is where z* is the standard normal value for the confidence level, MOE is your desired margin of error, and is the standard deviation. Because is an unknown value that you need, you may have to do a pilot study (small experimental study) to come up with a guess for the value of the standard deviation. To calculate the sample size for run through the following steps: Multiply z* times s. Divide by the desired margin of error, MOE. Square it. Round any fractional amount up to the nearest integer (so you achieve your desired MOE or better). Test statistic for the mean When conducting a hypothesis test for the population mean, you take the sample mean and find out how far it is from the claimed value in terms of a standard score. The standard score is called the test statistic. The formula for the test statistic for the mean is where is the claimed value for the population mean (the value that sits in the null hypothesis). To calculate the test statistic for the sample mean for samples of size 30 or more, you Calculate the sample mean, and the sample standard deviation, s. Take Calculate the standard error, Divide your result from Step 2 by the standard error found in Step 3. Correlation Sample correlation is a measure of the strength and direction of the linear relationship between two quantitative variables X and Y. It doesn’t measure any other type of relationship, and it doesn’t apply to categorical variables. The formula for correlation is To calculate the correlation, you Find the mean of all the x values and call it Find the mean of all the y values and call it Find the standard deviation of all the x values and call it sx. Find the standard deviation of all the y values and call it sy. For each (x, y) pair in the data set, take x minus and y minus and multiply them together. Add all these products together to get a sum. Divide the sum by sx x sy. Divide the result by n – 1 where n is the number of (x, y) pairs. (This is the same as multiplying by one over n – 1.) Regression line After examining a scatterplot between two numerical variables and calculating the sample correlation between the two variables, you might observe a linear relationship between them. In that case, it would be appropriate to estimate a regression line for estimating the value of the response variable (Y) given a value for the explanatory variable (X). Before calculating the regression line, you need five summary statistics: The mean of the x values The mean of the y values The standard deviation of the x values (denoted sx) The standard deviation of the y values (denoted sy) The correlation between X and Y (denoted r) So, to calculate the best-fit regression line, you Find the slope using the formula Find the y-intercept using the formula Piece together the results from Steps 1 and 2 to give you the regression line: y = mx + b.
View ArticleArticle / Updated 03-26-2016
In statistics, a confidence interval gives a range of plausible values for some unknown population characteristic. It contains an initial estimate plus or minus a margin of error (the amount by which you expect your results to vary if other samples were taken). The following table shows formulas for the components of the most common confidence intervals and keys for when to use them.
View ArticleArticle / Updated 03-26-2016
After data has been collected, the first step in analyzing it is to crunch some descriptive statistics to get an initial feeling for the data. For example: Where is the center of the data located? How spread out are the data? How correlated are the data from two variables? The most common descriptive statistics are in the following table, along with their formulas and a short description of what each one measures.
View ArticleArticle / Updated 03-26-2016
When designing a study, the sample size is an important consideration because the larger the sample size, the more data you have and the more precise your results will be (assuming high-quality data). If you know the level of precision you want (that is, your desired margin of error), you can calculate the sample size needed to achieve it. To find the sample size needed to estimate a population mean, or a population proportion (p), use the following formula: where z* is the critical value for the confidence level you need; MOE represents the desired margin of error; and represents the population standard deviation. If σ is unknown, When looking for estimate with the sample standard deviation, s, from a pilot study. When looking for p, estimate with p0(1 – p0), where p0 is some initial guess (usually 0.50) at p.
View Article