Articles From Alan Anderson
Filter Results
Cheat Sheet / Updated 12-21-2023
Statistics make it possible to analyze real-world business problems with actual data so that you can determine if a marketing strategy is really working, how much a company should charge for its products, or any of a million other practical questions. The science of statistics uses regression analysis, hypothesis testing, sampling distributions, and more to ensure accurate data analysis.
View Cheat SheetArticle / Updated 07-10-2023
You can use the Central Limit Theorem to convert a sampling distribution to a standard normal random variable. Based on the Central Limit Theorem, if you draw samples from a population that is greater than or equal to 30, then the sample mean is a normally distributed random variable. To determine probabilities for the sample mean the standard normal tables requires you to convert to a standard normal random variable. The standard normal distribution is the special case where the mean equals 0, and the standard deviation equals 1. For any normally distributed random variable X with a mean and a standard deviation you find the corresponding standard normal random variable (Z) with the following equation: For the sampling distribution of the corresponding equation is As an example, say that there are 10,000 stocks trading each day on a regional stock exchange. It's known from historical experience that the returns to these stocks have a mean value of 10 percent per year, and a standard deviation of 20 percent per year. An investor chooses to buy a random selection of 100 of these stocks for his portfolio. What's the probability that the mean rate of return among these 100 stocks is greater than 8 percent? The investor's portfolio can be thought of as a sample of stocks chosen from the population of stocks trading on the regional exchange. The first step to finding this probability is to compute the moments of the sampling distribution. Compute the mean: The mean of the sampling distribution equals the population mean. Determine the standard error: This calculation is a little trickier because the standard error depends on the size of the sample relative to the size of the population. In this case, the sample size (n) is 100, while the population size (N) is 10,000. So you first have to compute the sample size relative to the population size, like so: Because 1 percent is less than 5 percent, you don't use the finite population correction factor to compute the standard error. Note that in this case, the value of the finite population correction factor is: Because this value is so close to 1, using the finite population correction factor in this case would have little or no impact on the resulting probabilities. And because the finite population correction factor isn't needed in this case, the standard error is computed as follows: To determine the probability that the sample mean is greater than 8 percent, you must now convert the sample mean into a standard normal random variable using the following equation: To compute the probability that the sample mean is greater than 8 percent, you apply the previous formula as follows: Because these values are substituted into the previous expression as follows: You can calculate this probability by using the properties of the standard normal distribution along with a standard normal table such as this one. Standard Normal Table — Negative Values Z 0.00 0.01 0.02 0.03 –1.3 0.0968 0.0951 0.0934 0.0918 –1.2 0.1151 0.1131 0.1112 0.1093 –1.1 0.1357 0.1335 0.1314 0.1292 –1.0 0.1587 0.1562 0.1539 0.1515 The table shows the probability that a standard normal random variable (designated Z) is less than or equal to a specific value. For example, you can write the probability that (one standard deviation below the mean) as You find the probability from the table with these steps: Locate the first digit before and after the decimal point (–1.0) in the first (Z) column. Find the second digit after the decimal point (0.00) in the second (0.00) column. See where the row and column intersect to find the probability: Because you're actually looking for the probability that Z is greater than or equal to –1, one more step is required. Due to the symmetry of the standard normal distribution, the probability that Z is greater than or equal to a negative value equals one minus the probability that Z is less than or equal to the same negative value. For example, This is because are complementary events. This means that Z must either be greater than or equal to –2 or less than or equal to –2. Therefore, This is true because the occurrence of one of these events is certain, and the probability of a certain event is 1. After algebraically rewriting this equation, you end up with the following result: For the portfolio example, The result shows that there's an 84.13 percent chance that the investor's portfolio will have a mean return greater than 8 percent.
View ArticleArticle / Updated 05-03-2023
After you estimate the population regression line, you can check whether the regression equation makes sense by using the coefficient of determination, also known as R2 (R squared). This is used as a measure of how well the regression equation actually describes the relationship between the dependent variable (Y) and the independent variable (X). It may be the case that there is no real relationship between the dependent and independent variables; simple regression generates results even if this is the case. It is, therefore, important to subject the regression results to some key tests that enable you to determine if the results are reliable. The coefficient of determination, R2, is a statistical measure that shows the proportion of variation explained by the estimated regression line. Variation refers to the sum of the squared differences between the values of Y and the mean value of Y, expressed mathematically as R2 always takes on a value between 0 and 1. The closer R2 is to 1, the better the estimated regression equation fits or explains the relationship between X and Y. The expression is also known as the total sum of squares (TSS). This sum can be divided into the following two categories: Explained sum of squares (ESS): Also known as the explained variation, the ESS is the portion of total variation that measures how well the regression equation explains the relationship between X and Y. You compute the ESS with the formula Residual sum of squares (RSS): This expression is also known as unexplained variation and is the portion of total variation that measures discrepancies (errors) between the actual values of Y and those estimated by the regression equation. You compute the RSS with the formula The smaller the value of RSS relative to ESS, the better the regression line fits or explains the relationship between the dependent and independent variable. Total sum of squares (TSS): The sum of RSS and ESS equals TSS. R2 is the ratio of explained sum of squares (ESS) to total sum of squares (TSS): You can also use this formula: Based on the definition of R2, its value can never be negative. Also, R2 can't be greater than 1, so With simple regression analysis, R2 equals the square of the correlation between X and Y. The coefficient of determination is used as a measure of how well a regression line explains the relationship between a dependent variable (Y) and an independent variable (X). The closer the coefficient of determination is to 1, the more closely the regression line fits the sample data. The coefficient of determination is computed from the sums of squares. These calculations are summarized in the following table. To compute ESS, you subtract the mean value of Y from each of the estimated values of Y; each term is squared and then added together: To compute RSS, you subtract the estimated value of Y from each of the actual values of Y; each term is squared and then added together: To compute TSS, you subtract the mean value of Y from each of the actual values of Y; each term is squared and then added together: Alternatively, you can simply add ESS and RSS to obtain TSS: TSS = ESS + RSS = 0.54 + 0.14 = 0.68 The coefficient of determination (R2) is the ratio of ESS to TSS: This shows that 79.41 percent of the variation in Y is explained by variation in X. Because the coefficient of determination can't exceed 100 percent, a value of 79.41 indicates that the regression line closely matches the actual sample data.
View ArticleCheat Sheet / Updated 03-10-2022
Summary statistical measures represent the key properties of a sample or population as a single numerical value. This has the advantage of providing important information in a very compact form. It also simplifies comparing multiple samples or populations. Summary statistical measures can be divided into three types: measures of central tendency, measures of central dispersion, and measures of association.
View Cheat SheetArticle / Updated 03-26-2016
Regression analysis is a statistical tool used for the investigation of relationships between variables. Usually, the investigator seeks to ascertain the causal effect of one variable upon another — the effect of a price increase upon demand, for example, or the effect of changes in the money supply upon the inflation rate. Regression analysis is used to estimate the strength and the direction of the relationship between two linearly related variables: X and Y. X is the "independent" variable and Y is the "dependent" variable. The two basic types of regression analysis are: Simple regression analysis: Used to estimate the relationship between a dependent variable and a single independent variable; for example, the relationship between crop yields and rainfall. Multiple regression analysis: Used to estimate the relationship between a dependent variable and two or more independent variables; for example, the relationship between the salaries of employees and their experience and education. Multiple regression analysis introduces several additional complexities but may produce more realistic results than simple regression analysis. Regression analysis is based on several strong assumptions about the variables that are being estimated. Several key tests are used to ensure that the results are valid, including hypothesis tests. These tests are used to ensure that the regression results are not simply due to random chance but indicate an actual relationship between two or more variables. An estimated regression equation may be used for a wide variety of business applications, such as: Measuring the impact on a corporation's profits of an increase in profits Understanding how sensitive a corporation's sales are to changes in advertising expenditures Seeing how a stock price is affected by changes in interest rates Regression analysis may also be used for forecasting purposes; for example, a regression equation may be used to forecast the future demand for a company's products. Due to the extreme complexity of regression analysis, it is often implemented through the use of specialized calculators or spreadsheet programs.
View ArticleArticle / Updated 03-26-2016
In statistics, hypothesis testing refers to the process of choosing between competing hypotheses about a probability distribution, based on observed data from the distribution. It's a core topic and a fundamental part of the language of statistics. Hypothesis testing is a six-step procedure: 1. Null hypothesis 2. Alternative hypothesis 3. Level of significance 4. Test statistic 5. Critical value(s) 6. Decision rule The null hypothesis is a statement that's assumed to be true unless there's strong contradictory evidence. The alternative hypothesis is a statement that will be accepted in place of the null hypothesis if it is rejected. The level of significance is chosen to control the probability of a "Type I" error; this is the error that results when the null hypothesis is erroneously rejected. The test statistic and critical values are used to determine if the null hypothesis should be rejected. The decision rule that is followed is that an "extreme" test statistic results in rejection of the null hypothesis. Here, an extreme test statistic is one that lies outside the bounds of the critical value or values. Hypotheses are often tested about the values of population measures such as the mean and the variance. They are also used to determine if a population follows a specified probability distribution. They also form a major part of regression analysis, where hypotheses are used to validate the results of an estimated regression equation.
View ArticleArticle / Updated 03-26-2016
Random variables and probability distributions are two of the most important concepts in statistics. A random variable assigns unique numerical values to the outcomes of a random experiment; this is a process that generates uncertain outcomes. A probability distribution assigns probabilities to each possible value of a random variable. The two basic types of probability distributions are discrete and continuous. A discrete probability distribution can only assume a finite number of different values. Examples of discrete distributions include: Binomial Geometric Poisson A continuous probability distribution can assume an infinite number of different values. Examples of continuous distributions include: Uniform Normal Student's t Chi-square F
View ArticleArticle / Updated 03-26-2016
When you're working with populations and samples (a subset of a population) in business statistics, you can use three common types of measures to describe the data set: central tendency, dispersion, and association. By convention, the statistical formulas used to describe population measures contain Greek letters, while the formulas used to describe sample measures contain Latin letters. Measures of central tendency In statistics, the mean, median, and mode are known as measures of central tendency; they are used to identify the center of a data set: Mean: The value between the largest and smallest values of a data set, obtained by a prescribed method. Median: The value which divides a data set into two equal halves Mode: The most commonly observed value in a data set Samples are randomly chosen from populations. If this process is carried out correctly, each sample should accurately reflect the characteristics of the population. So, a sample measure, such as the mean, should be a good estimate of the corresponding population measure. Consider the following examples of mean: Population mean: This formula simply tells you to add up all the elements in the population and divide by the size of the population. Sample mean: The process for computing this is exactly the same; you add up all the elements in the sample and divide by the size of the sample. In addition to measures of central tendency, two other key types of measures are measures of dispersion (spread) and measures of association. Measures of dispersion Measures of dispersion include variance/standard deviation and percentiles/quartiles/interquartile range. The variance and standard deviation are closely related to each other; the standard deviation always equals the square root of the variance. The formulas for the population and sample variance are: Population variance: Sample variance: Percentiles split up a data set into 100 equal parts each consisting of 1 percent of the values in the data set. Quartiles are a special type of percentiles; they split up the data into four equal parts. The interquartile range represents the middle 50 percent of the data; it's calculated as the third quartile minus the first quartile. Measures of association Another type of measure, known as a measure of association, refers to the relationship between two samples or two populations. Two examples of this are the covariance and the correlation: Population covariance: Sample covariance: Population correlation: Sample correlation: The correlation is closely related to the covariance; it's defined to ensure that its value is always between negative one and positive one.
View ArticleArticle / Updated 03-26-2016
In the field of risk management, you can measure the risk of a portfolio with the Value at Risk (VaR) methodology. The standard VaR model (known as the variance-covariance approach) is typically based on the assumption that the returns to a portfolio follow the normal distribution. But the assumption that financial returns are normal when they're not normal can have adverse consequences. Because financial returns tend to exhibit "fat tails" in practice, the standard VaR model understates the true probability of large losses. (It also understates the true probability of large gains, but this is less important because risk management is focused on avoiding losses.) The potential to miss out on large losses due to an incorrect assumption of normality is one of the major weaknesses of the traditional VaR model. Reliance on faulty models has the potential to lead to disastrous results. One perfect example of this outcome is the hedge fund Long-Term Capital Management (LTCM). LTCM failed spectacularly in 1998, requiring a $3.6 billion bailout from the Federal Reserve System to prevent a full-scale global financial panic. LTCM was managed by two winners of the Nobel Prize in Economics: Robert Merton and Myron Scholes. Market participants lined up to invest with LTCM; surely, Nobel laureates could do no wrong! The fund sought out investments that exploited small price discrepancies between similar assets trading in different markets: assets that were seen as relatively underpriced were bought, and assets that were seen as relatively overpriced were sold. These investments weren't very risky, but the potential return was also quite low. To increase the potential profits from these trades, LTCM used enormous amounts of borrowed money to finance the trades. As a result, the potential return increased dramatically but so did the risk. The fund was quite successful for its first four years, 1994 through 1997. But 1998 was a very different matter. In August 1998, the Russian government triggered a global crisis by devaluing its currency (the ruble) and defaulting on its debt, which led to huge losses for LTCM's positions. To make things worse, LTCM found itself unable to unwind its positions or raise new capital due to the ongoing panic. As a result, the Federal Reserve engineered a bailout of the fund to prevent damage to the rest of the financial system. The use of Value at Risk by LTCM to measure the risk of its positions has been cited as one of the reasons for the fund's downfall. The likelihood of such spectacular losses was assumed to be extremely miniscule, much smaller than it actually was. When large losses started piling up, it was too late to make any changes in the fund's strategies. Due to the shortcomings of the traditional Value at Risk methodology, other distributional assumptions have been attempted, such as the Student's t-distribution. Another approach that has been tried is known as Extreme Value Theory (EVT), which models only the left tail of the distribution of portfolio gains and losses. These approaches have the potential to increase the accuracy of the Value at Risk methodology but at the cost of substantially more complexity.
View ArticleArticle / Updated 03-26-2016
In statistics, sampling distributions are the probability distributions of any given statistic based on a random sample, and are important because they provide a major simplification on the route to statistical inference. More specifically, they allow analytical considerations to be based on the sampling distribution of a statistic, rather than on the joint probability distribution of all the individual sample values. The value of a sample statistic such as the sample mean (X) is likely to be different for each sample that is drawn from a population. It can, therefore, be thought of as a random variable, whose properties can be described with a probability distribution. The probability distribution of a sample statistic is known as a sampling distribution. According to a key result in statistics known as the Central Limit Theorem, the sampling distribution of the sample mean is normal if one of two things is true: The underlying population is normal The sample size is at least 30 Two moments are needed to compute probabilities for the sample mean; the mean of the sampling distribution equals: The standard deviation of the sampling distribution (also known as the standard error) can take on one of two possible values: This is the appropriate choice for a "small" sample; for example, the sample size is less than or equal to 5 percent of the population size. If the sample is "large," the standard error becomes: Probabilities may be computed for the sample mean directly from the standard normal table by applying the following formula:
View Article