Articles From David Semmelroth
Filter Results
Cheat Sheet / Updated 03-10-2022
Summary statistical measures represent the key properties of a sample or population as a single numerical value. This has the advantage of providing important information in a very compact form. It also simplifies comparing multiple samples or populations. Summary statistical measures can be divided into three types: measures of central tendency, measures of central dispersion, and measures of association.
View Cheat SheetArticle / Updated 03-26-2016
The two basic types of probability distributions are known as discrete and continuous. Discrete distributions describe the properties of a random variable for which every individual outcome is assigned a positive probability. A random variable is actually a function; it assigns numerical values to the outcomes of a random process. Continuous distributions describe the properties of a random variable for which individual probabilities equal zero. Positive probabilities can only be assigned to ranges of values, or intervals. Two of the most widely used discrete distributions are the binomial and the Poisson. You use the binomial distribution when a random process consists of a sequence of independent trials, each of which has only two possible outcomes. The probabilities of these outcomes are constant on each trial. For example, you could use the binomial distribution to determine the probability that a specified number of defaults will take place in a portfolio of bonds (if you can assume that the bonds are independent of each other). You use the Poisson distribution when a random process consists of events occurring over a given interval of time. For example, you could use the Poisson distribution to determine the likelihood that three stocks in an investor’s portfolio pay dividends over the coming year. Some of the most widely used continuous probability distributions are the: Normal distribution Student’s t-distribution Lognormal distribution Chi-square distribution F-distribution The normal distribution is one of the most widely used distributions in many disciplines, including economics, finance, biology, physics, psychology, and sociology. The normal distribution is often illustrated as a bell-shaped curve, or bell curve, which indicates that the distribution is symmetrical about its mean. Further, it is defined for all values from negative infinity to positive infinity. Many real-world variables seem to follow the normal distribution (at least approximately), which accounts for its popularity. For example, it’s often assumed that returns to financial assets are normally distributed (although this isn’t entirely correct). For situations in which the normal distribution is not appropriate, the Student’s t-distribution is often used in its place. The Student’s t-distribution shares several similar properties with the normal distribution; however, the most important difference is that it is more “spread out” about the mean. The Student’s t-distribution is often used for analyzing the properties of small samples. The lognormal distribution is closely related to the normal distribution, as follows: If Y = lnX and X is lognormally distributed, then Y is normally distributed. If X = eY and Y is normally distributed, then X is lognormally distributed. For example, if returns to financial assets are normally distributed, then their prices are lognormally distributed. Unlike the normal distribution, the lognormal distribution is only defined for non-negative values. Instead of being symmetrical, the lognormal distribution is positively skewed. The chi-square distribution is characterized by degrees of freedom and is defined only for non-negative values. It is also positively skewed. You can use the chi-square distribution for several applications, including these: Testing hypotheses about the variance of a population Testing whether a population follows a specified probability distribution Determining if two populations are independent of each other The F-distribution is characterized by two different degrees of freedom: numerator and denominator. It’s defined only for non-negative values and is positively skewed. You can use the F-distribution to determine whether the variances of two populations are equal. You can also use it in regression analysis to determine if a group of slope coefficients are statistically significant.
View ArticleArticle / Updated 03-26-2016
Hypothesis testing is a statistical technique that is used in a variety of situations. Though the technical details differ from situation to situation, all hypothesis tests use the same core set of terms and concepts. The following descriptions of common terms and concepts refer to a hypothesis test in which the means of two populations are being compared. Null hypothesis The null hypothesis is a clear statement about the relationship between two (or more) statistical objects. These objects may be measurements, distributions, or categories. Typically, the null hypothesis, as the name implies, states that there is no relationship. In the case of two population means, the null hypothesis might state that the means of the two populations are equal. Alternative hypothesis Once the null hypothesis has been stated, it is easy to construct the alternative hypothesis. It is essentially the statement that the null hypothesis is false. In our example, the alternative hypothesis would be that the means of the two populations are not equal. Significance The significance level is a measure of the statistical strength of the hypothesis test. It is often characterized as the probability of incorrectly concluding that the null hypothesis is false. The significance level is something that you should specify up front. In applications, the significance level is typically one of three values: 10%, 5%, or 1%. A 1% significance level represents the strongest test of the three. For this reason, 1% is a higher significance level than 10%. Power Related to significance, the power of a test measures the probability of correctly concluding that the null hypothesis is true. Power is not something that you can choose. It is determined by several factors, including the significance level you select and the size of the difference between the things you are trying to compare. Unfortunately, significance and power are inversely related. Increasing significance decreases power. This makes it difficult to design experiments that have both very high significance and power. Test statistic The test statistic is a single measure that captures the statistical nature of the relationship between observations you are dealing with. The test statistic depends fundamentally on the number of observations that are being evaluated. It differs from situation to situation. Distribution of the test statistic The whole notion of hypothesis rests on the ability to specify (exactly or approximately) the distribution that the test statistic follows. In the case of this example, the difference between the means will be approximately normally distributed (assuming there are a relatively large number of observations). One-tailed vs. two-tailed tests Depending on the situation, you may want (or need) to employ a one- or two-tailed test. These tails refer to the right and left tails of the distribution of the test statistic. A two-tailed test allows for the possibility that the test statistic is either very large or very small (negative is small). A one-tailed test allows for only one of these possibilities. In an example where the null hypothesis states that the two population means are equal, you need to allow for the possibility that either one could be larger than the other. The test statistic could be either positive or negative. So, you employ a two-tailed test. The null hypothesis might have been slightly different, namely that the mean of population 1 is larger than the mean of population 2. In that case, you don't need to account statistically for the situation where the first mean is smaller than the second. So, you would employ a one-tailed test. Critical value The critical value in a hypothesis test is based on two things: the distribution of the test statistic and the significance level. The critical value(s) refer to the point in the test statistic distribution that give the tails of the distribution an area (meaning probability) exactly equal to the significance level that was chosen. Decision Your decision to reject or accept the null hypothesis is based on comparing the test statistic to the critical value. If the test statistic exceeds the critical value, you should reject the null hypothesis. In this case, you would say that the difference between the two population means is significant. Otherwise, you accept the null hypothesis. P-value The p-value of a hypothesis test gives you another way to evaluate the null hypothesis. The p-value represents the highest significance level at which your particular test statistic would justify rejecting the null hypothesis. For example, if you have chosen a significance level of 5%, and the p-value turns out to be .03 (or 3%), you would be justified in rejecting the null hypothesis.
View ArticleArticle / Updated 03-26-2016
Several different types of graphs may be useful for analyzing data. These include stem-and-leaf plots, scatter plots, box plots, histograms, quantile-quantile (QQ) plots, and autocorrelation plots. A stem-and-leaf plot consists of a “stem” that reflects the categories in a data set and a “leaf” that shows each individual value in the data set. A scatter plot consists of a series of points that reflect observations from two data sets. The plot shows the relationship between the two data sets. A box plot shows summary measures for a data set. The plot takes the form of a rectangle whose shape represents measures such as the minimum value, the maximum value, the quartiles, and so on. A histogram shows the distribution of a data set as a series of vertical bars. Each bar represents a category (usually a numerical value or a range of numerical values) found in a data set. The height of each bar represents the frequency of values in the category. Histograms are often used to identify the distribution a data set follows. A QQ (quantile-quantile plot) compares the distribution of a data set with an assumed distribution. An autocorrelation plot is used to show how closely related the elements of a time series are to their own past values.
View ArticleArticle / Updated 03-26-2016
One important way to draw conclusions about the properties of a population is with hypothesis testing. You can use hypothesis tests to compare a population measure to a specified value, compare measures for two populations, determine whether a population follows a specified probability distribution, and so forth. Hypothesis testing is conducted as a six-step procedure: Null hypothesis Alternative hypothesis Level of significance Test statistic Critical value Decision The null hypothesis is a statement that’s assumed to be true unless there’s strong evidence against it. The alternative hypothesis is a statement that is accepted if the null hypothesis is rejected. The level of significance specifies the likelihood of rejecting the null hypothesis when it’s true; this is known as a Type I Error. The test statistic is a numerical measure you compute from sample data to determine whether or not the null hypothesis should be rejected. The critical value is used as a benchmark to determine whether the test statistic is too extreme to be consistent with the null hypothesis. The decision as to whether or not the null hypothesis should be rejected is determined as follows: If the absolute value of the test statistic exceeds the absolute value of the critical value, the null hypothesis is rejected. Otherwise, the null hypothesis fails to be rejected.
View ArticleArticle / Updated 03-26-2016
Measures of association quantify the strength and the direction of the relationship between two data sets. Here are the two most commonly used measures of association: Covariance Correlation Both measures are used to show how closely two data sets are related to each other. The main difference between them is the units in which they are measured. The correlation measure is defined to assume values between –1 and 1, which makes interpretation very easy. Covariance The covariance between two samples is computed as follows: The covariance between two populations is computed as follows: Correlation The correlation between two samples is computed like this: The correlation between two populations is computed like this:
View ArticleArticle / Updated 03-26-2016
Measures of central dispersion show how "spread out" the elements of a data set are from the mean. Three of the most commonly used measures of central dispersion include the following: Range Variance Standard deviation Range The range of a data set is the difference between the largest value and the smallest value. You compute it the same way for both samples and populations. Variance You can think of the variance as the average squared difference between the elements of a data set and the mean. The formulas for computing a sample variance and a population variance are slightly different. Here is the formula for computing sample variance: And here is the formula for computing population variance: Standard deviation The standard deviation is simply the square root of the variance. It's more commonly used as a measure of dispersion than the variance because it's measured in the same units as the elements of the data set, whereas the variance is measured in squared units.
View ArticleArticle / Updated 03-26-2016
Measures of central tendency show the center of a data set. Three of the most commonly used measures of central tendency are the mean, median, and mode. Mean Mean is another word for average. Here is the formula for computing the mean of a sample: With this formula, you compute the sample mean by simply adding up all the elements in the sample and then dividing by the number of elements in the sample. Here is the corresponding formula for computing the mean of a population: Although the notation is slightly different, the procedure for computing a population mean is the same as the procedure for computing a sample mean. Greek letters are used to describe populations, whereas Roman letters are used to describe samples. Median The median of a data set is a value that divides the data into two equal halves. In other words, half of the elements of a data set are less than the median, and the remaining half are greater than the median. The procedure for computing the median is the same for both samples and populations. Mode The mode of a data set is the most commonly observed value in the data set. You determine the mode in the same way for a sample and a population.
View ArticleArticle / Updated 03-26-2016
For a dataset that consists of observations taken at different points in time (that is, time series data), it's important to determine whether or not the observations are correlated with each other. This is because many techniques for modeling time series data are based on the assumption that the data is uncorrelated with each other (independent). One graphical technique you can use to see whether the data is uncorrelated with each other is the autocorrelation function. The autocorrelation function shows the correlation between observations in a time series with different lags. For example, the correlation between observations with lag 1 refers to the correlation between each individual observation and its previous value. This figure shows the autocorrelation function for ExxonMobil's daily returns in 2013. Autocorrelation function of daily returns to ExxonMobil stock in 2013. Each "spike" in the autocorrelation function represents the correlation between observations with a given lag. The autocorrelation with lag 0 always equals 1, because this represents the correlations of the observations with themselves. On the graph, the dashed lines represent the lower and upper limits of a confidence interval. If a spike rises above the upper limit of the confidence interval or falls below the lower limit of the confidence interval, that shows that the correlation for that lag isn't 0. This is evidence against the independence of the elements in a dataset. In this case, there is only one statistically significant spike (at lag 8). This spike shows that the ExxonMobil returns may be independent. A more formal statistical test would show whether that is true or not.
View ArticleArticle / Updated 03-26-2016
A box plot is designed to show several key statistics for a dataset in the form of a vertical rectangle or box. The statistics it can show include the following: Minimum value Maximum value First quartile (Q1) Second quartile (Q2) Third quartile (Q3) Interquartile range (IQR) The first quartile of a dataset is a numerical measure that divides the data into two parts: the smallest 25 percent of the observations and the largest 75 percent of the observations. In other words, the first quartile is a numerical value with the following properties: 25 percent of the observations in the dataset are smaller than the first quartile. 75 percent of the observations in the dataset are greater than the first quartile. Similarly, the second quartile (also known as the median) divides the data in half, so 50 percent of the elements are smaller than the median, and 50 percent are larger. The third quartile is the value for which the following are true: 75 percent of the observations in the dataset are smaller than the third quartile. 25 percent of the observations in the dataset are greater than the third quartile. The interquartile range (IQR) is the difference between the third quartile and first quartile: IQR = Q3 – Q1. The interquartile range is a measure of dispersion; it shows how much spread there is between the elements in the middle 50 percent of a dataset. A box plot is drawn so that The top of the box represents the third quartile (Q3) of the data. The bottom of the box represents the first quartile (Q1) of the data. The middle of the box (shown with a line) represents the second quartile (Q2). In addition, there's a line above the box to indicate the maximum value in the data that doesn't exceed Q3 + 1.5 x IQR and a line below the box to indicate the minimum value in the data that doesn't fall below Q1 – 1.5 x IQR. Values outside of this range are outliers and are shown on the box plot as individual points. This figure shows a box plot of the daily prices of Microsoft stock from January 1, 2013 to December 31, 2013. Box plot of daily prices for Microsoft stock. There are no outliers in this data. Therefore, the bottom line in the box plot shows that the lowest price during this period was somewhat less than $26.00, and the top line shows that the highest price was just over $38. The bottom of the box corresponds to the first quartile, which is $27.43; the solid line in the middle of the box corresponds to the second quartile (median), which is $31.89. The top of the box corresponds to the third quartile, which is $33.78. The height of the box equals the interquartile range (IQR), which is $6.35. As another example, this figure shows a box plot of the daily prices of Apple stock from January 1, 2013 to December 31, 2013. Box plot of daily prices for Apple stock from January 1, 2013 to December 31, 2013. The lowest price in 2013 for Apple stock was $53.84, and the highest price was $80.11. There are no outliers in the data, so these values are shown by the bottom line and top line, respectively. The first quartile, shown at the bottom of the box, was $60.48. The second quartile was $63.65 (shown by the solid black line) and the third quartile was $70.32, shown at the top of the box. As a result, the interquartile range (IQR) is $9.84.
View Article