Statistical Analysis with R For Dummies book cover

Statistical Analysis with R For Dummies

Overview

Understanding the world of R programming and analysis has never been easier

Most guides to R, whether books or online, focus on R functions and procedures. But now, thanks to Statistical Analysis with R For Dummies, you have access to a trusted, easy-to-follow guide that focuses on the foundational statistical concepts that R addresses—as well as step-by-step guidance that shows you exactly how to implement them using R programming.

People are becoming more aware of R every day as major institutions are adopting it as a standard. Part of its appeal is that it's a free tool that's taking the place of costly statistical software packages that sometimes take an inordinate amount of time to learn. Plus, R enables a user to carry out complex statistical analyses by simply entering a few commands, making sophisticated analyses available and understandable to a wide audience. Statistical Analysis with R For Dummies enables you to perform these analyses and to fully understand their implications and results.

  • Gets you up to speed on the #1 analytics/data science software tool
  • Demonstrates how to easily find, download, and use cutting-edge community-reviewed methods in statistics and predictive modeling
  • Shows you how R offers intel from leading researchers in data science, free of charge
  • Provides information on using R Studio to work with R

Get ready to use R to crunch and analyze your data—the fast and easy way!

Understanding the world of R programming and analysis has never been easier

Most guides to R, whether books or online, focus on R functions and procedures. But now, thanks to Statistical Analysis with R For Dummies, you have access to a trusted, easy-to-follow guide that focuses on the foundational statistical concepts that R addresses—as well as step-by-step guidance that shows you exactly how to implement them using R programming.

People are becoming more aware of R every day as major institutions are adopting it as a standard. Part of its appeal is that it's a free tool that's taking the place of costly statistical software packages that sometimes take an inordinate amount of time to learn. Plus, R enables a user to carry

out complex statistical analyses by simply entering a few commands, making sophisticated analyses available and understandable to a wide audience. Statistical Analysis with R For Dummies enables you to perform these analyses and to fully understand their implications and results.
  • Gets you up to speed on the #1 analytics/data science software tool
  • Demonstrates how to easily find, download, and use cutting-edge community-reviewed methods in statistics and predictive modeling
  • Shows you how R offers intel from leading researchers in data science, free of charge
  • Provides information on using R Studio to work with R

Get ready to use R to crunch and analyze your data—the fast and easy way!

Statistical Analysis with R For Dummies Cheat Sheet

R provides a wide array of functions to help you with statistical analysis with R—from simple statistics to complex analyses. Several statistical functions are built into R and R packages. R statistical functions fall into several categories including central tendency and variability, relative standing, t-tests, analysis of variance and regression analysis.

Articles From The Book

25 results

R Articles

Standard Deviation in R

After you calculate the variance of a set of numbers, you have a value whose units are different from your original measurements. For example, if your original measurements are in inches, their variance is in square inches. This is because you square the deviations before you average them. So the variance in the five-score population in the preceding example is 6.8 square inches. It might be hard to grasp what that means. Often, it's more intuitive if the variation statistic is in the same units as the original measurements. It's easy to turn variance into that kind of statistic. All you have to do is take the square root of the variance. Like the variance, this square root is so important that it is has a special name: standard deviation.

Population standard deviation

The standard deviation of a population is the square root of the population variance. The symbol for the population standard deviation is Σ (sigma). Its formula is
For this 5-score population of measurements (in inches): 50, 47, 52, 46, and 45 the population variance is 6.8 square inches, and the population standard deviation is 2.61 inches (rounded off).

Sample standard deviation

The standard deviation of a sample — an estimate of the standard deviation of a population — is the square root of the sample variance. Its symbol is s and its formula is For this sample of measurements (in inches): 50, 47, 52, 46, and 45 the estimated population variance is 8.4 square inches, and the estimated population standard deviation is 2.92 inches (rounded off).

Using R to compute standard deviation

As is the case with variance, using R to compute the standard deviation is easy: You use the sd() function. And like its variance counterpart, sd() calculates s, not Σ: > sd(heights) [1] 2.915476 For Σ — treating the five numbers as a self-contained population, in other words — you have to multiply the sd() result by the square root of (N-1)/N: > sd(heights)*(sqrt((length(heights)-1)/length(heights))) [1] 2.607681 Again, if you're going to use this one frequently, defining a function is a good idea: sd.p=function(x){sd(x)*sqrt((length(x)-1)/length(x))} And here's how you use this function: > sd.p(heights) [1] 2.607681

R Articles

Testing a Variance in R

You might think that the function chisq.test() would be the best way to test a variance in R. Although base R provides this function, it's not appropriate here. Statisticians use this function to test other kinds of hypotheses. Instead, turn to a function called varTest, which is in the EnvStats package. On the Packages tab, click Install. Then type EnvStats into the Install Packages dialog box and click Install. When EnvStats appears on the Packages tab, select its check box. Before you use the test, you create a vector to hold the ten measurements: FarKlempt.data2 <- c(12.43, 11.71, 14.41, 11.05, 9.53, 11.66, 9.33,11.71,14.35,13.81) And now, the test: varTest(FarKlempt.data2,alternative="greater",conf.level = 0.95,sigma.squared = 2.25) The first argument is the data vector. The second specifies the alternative hypothesis that the true variance is greater than the hypothesized variance, the third gives the confidence level (1 – ɑ), and the fourth is the hypothesized variance. Running that line of code produces these results: Results of Hypothesis Test -------------------------- Null Hypothesis: variance = 2.25 Alternative Hypothesis: True variance is greater than 2.25 Test Name: Chi-Squared Test on Variance Estimated Parameter(s): variance = 3.245299 Data: FarKlempt.data2 Test Statistic: Chi-Squared = 12.9812 Test Statistic Parameter: df = 9 P-value: 0.163459 95% Confidence Interval: LCL = 1.726327 UCL = Inf Among other statistics, the output shows the chi-square (12.9812) and the p-value (0.163459). (The chi-square value in the previous section is a bit lower because of rounding.) The p-value is greater than .05. Therefore, you cannot reject the null hypothesis. How high would chi-square (with df = 9) have to be in order to reject? Hmmm. . . .

R Articles

Plotting t in ggplot2

The grammar-of-graphics approach takes considerably more effort when plotting the values of a t-distribution than base R. But follow along and you'll learn a lot about ggplot2. You start by putting the relevant numbers into a data frame: t.frame = data.frame(t.values, df3 = dt(t.values,3), df10 = dt(t.values,10), std_normal = dnorm(t.values)) The first six rows of the data frame look like this: > head(t.frame) t.values df3 df10 std_normal 1 -4.0 0.009163361 0.002031034 0.0001338302 2 -3.9 0.009975671 0.002406689 0.0001986555 3 -3.8 0.010875996 0.002854394 0.0002919469 4 -3.7 0.011875430f 0.003388151 0.0004247803 5 -3.6 0.012986623 0.004024623 0.0006119019 6 -3.5 0.014224019 0.004783607 0.0008726827 That's a pretty good-looking data frame, but it's in wide format. ggplot() prefers long format — which is the three columns of density-numbers stacked into a single column. To get to that format — it's called reshaping the data — make sure you have the reshape2 package installed. Select its check box on the Packages tab and you're ready to go. Reshaping from wide format to long format is called melting the data, so the function is t.frame.melt <- melt(t.frame,id="t.values") The id argument specifies that t.values is the variable whose numbers don't get stacked with the rest. Think of it as the variable that stores the data. The first six rows of t.frame.melt are: > head(t.frame.melt) t.values variable value 1 -4.0 df3 0.009163361 2 -3.9 df3 0.009975671 3 -3.8 df3 0.010875996 4 -3.7 df3 0.011875430 5 -3.6 df3 0.012986623 6 -3.5 df3 0.014224019 It's always a good idea to have meaningful column names, so . . . > colnames(t.frame.melt)= c("t","df","density") > head(t.frame.melt) t df density 1 -4.0 df3 0.009163361 2 -3.9 df3 0.009975671 3 -3.8 df3 0.010875996 4 -3.7 df3 0.011875430 5 -3.6 df3 0.012986623 6 -3.5 df3 0.014224019 Now for one more thing before you start on the graph. This is a vector that will be useful when you lay out the x-axis: x.axis.values <- seq(-4,4,2) Begin with ggplot(): ggplot(t.frame.melt, aes(x=t,y=f(t),group =df)) The first argument is the data frame. The aesthetic mappings tell you that t is on the x-axis, density is on the y-axis, and the data falls into groups specified by the df variable. This is a line plot, so the appropriate geom function to add is geom_line: geom_line(aes(linetype=df)) Geom functions can work with aesthetic mappings. The aesthetic mapping here maps df to the type of line. Rescale the x-axis so that it goes from –4 to 4, by twos. Here's where to use that x.axis.values vector: scale_x_continuous(breaks=x.axis.values,labels=x.axis.values) The first argument sets the breakpoints for the x-axis, and the second provides the labels for those points. Putting these three statements together ggplot(t.frame.melt, aes(x=t,y=density,group =df)) + geom_line(aes(linetype=df)) + scale_x_continuous(breaks = x.axis.values,labels = x.axis.values) results in the following figure. One of the benefits of ggplot2 is that the code automatically produces a legend. You still have some work to do. First of all, the default linetype assignments are not what you want, so you have to redo them: scale_linetype_manual(values = c("dotted","dashed","solid"), labels = c("3","10", expression(infinity))) The four statements ggplot(t.frame.melt, aes(x=t,y=density,group =df)) + geom_line(aes(linetype=df)) + scale_x_continuous(breaks = x.axis.values,labels = x.axis.values)+ scale_linetype_manual(values = c("dotted","dashed","solid"), labels = c("3","10", expression(infinity))) produce the following. As you can see, the items in the legend are not in the order that the curves appear at their centers. A graph is more comprehensible when the graph elements and the legend elements are in sync. ggplot2 provides guide functions that enable you to control the legend's details. To reverse the order of the linetypes in the legend, here's what you do: guides(linetype=guide_legend(reverse = TRUE)) Putting all the code together, finally, yields the following figure. ggplot(t.frame.melt, aes(x=t,y=density,group =df)) + geom_line(aes(linetype=df)) + scale_x_continuous(breaks = x.axis.values,labels = x.axis.values)+ scale_linetype_manual(values = c("dotted","dashed","solid"), labels = c("3","10", expression(infinity)))+ guides(linetype=guide_legend(reverse = TRUE)) Base R graphics versus ggplot2: It's like driving a car with a standard transmission versus driving with an automatic transmission!