Articles From Joseph Schmuller
Filter Results
Cheat Sheet / Updated 04-12-2024
A wide range of tools is available that are designed to help big businesses and small take advantage of the data science revolution. Among the most essential of these tools are Microsoft Power BI, Tableau, SQL, and the R and Python programming languages.
View Cheat SheetArticle / Updated 06-07-2023
Excel can help you make all sorts of calculations. Here's a selection of Excel's statistical worksheet functions. Each one returns a value into a selected cell. Check out these functions for central tendency and variability. Function What it calculates AVERAGE Mean of a set of numbers AVERAGEIF Mean of a set of numbers that meet a condition AVERAGEIFS Mean of a set of numbers that meet one or more conditions HARMEAN Harmonic mean of a set of positive numbers GEOMEAN Geometric mean of a set of positive numbers MODE.SNGL Mode of a set of numbers MEDIAN Median of a set of numbers VAR.P Variance of a set of numbers considered to be a population VAR.S Variance of a set of numbers considered to be a sample STDEV.P Standard deviation of a set of numbers considered to be a population STDEV.S Standard deviation of a set of numbers considered to be a sample STANDARDIZE A standard score based on a given mean and standard deviation These handy functions for relative standing can also be very useful. Function What it calculates RANK.EQ Rank of a number in a set of numbers. If more than one number has the same rank, it returns the top rank of those numbers. RANK.AVG Rank of a number in a set of numbers. If more than one number has the same rank, it returns their average. PERCENTRANK.INC Rank of a number in a set of numbers, expressed as a percent of the numbers it's greater than or equal to. PERCENTRANT.EXC Rank of a number in a set of numbers, expressed as a percent of the numbers it's greater than. PERCENTILE.INC The indicated percentile in a set of numbers, in terms of "greater than or equal to." PERCENTILE.EXC The indicated percentile in a set of numbers, in terms of "greater than." QUARTILE.INC The 1st, 2nd, 3rd, or 4th quartile of a set of numbers, in terms of "greater than or equal to." QUARTILE.EXC The 1st, 2nd, 3rd, or 4th quartile of a set of numbers, in terms of "greater than." These functions for correlation and regression are also good ones to know. Function What it Calculates CORREL Correlation coefficient between two sets of numbers PEARSON Same as CORREL. (Go figure!) RSQ Coefficient of determination between two sets of numbers (square of the correlation coefficient) SLOPE Slope of a regression line through two sets of numbers INTERCEPT Intercept of a regression line through two sets of numbers STEYX Standard error of estimate for a regression line through two sets of numbers
View ArticleCheat Sheet / Updated 01-11-2023
R provides a wide array of functions to help you with your work — from simple statistics to complex analyses. This Cheat Sheet is a handy reference for Base R statistical functions, interactive applications, machine learning, databases, and images.
View Cheat SheetCheat Sheet / Updated 05-02-2022
To complete any project using R, you work with functions that live in packages designed for specific areas. This cheat sheet provides some information about these functions.
View Cheat SheetCheat Sheet / Updated 01-26-2022
R provides a wide array of functions to help you with statistical analysis with R—from simple statistics to complex analyses. Several statistical functions are built into R and R packages. R statistical functions fall into several categories including central tendency and variability, relative standing, t-tests, analysis of variance and regression analysis.
View Cheat SheetCheat Sheet / Updated 11-12-2021
Excel offers a wide range of statistical functions you can use to calculate a single value or an array of values in your Excel worksheets. The Excel Analysis Toolpak is an add-in that provides even more statistical analysis tools.
View Cheat SheetArticle / Updated 09-30-2019
After you calculate the variance of a set of numbers, you have a value whose units are different from your original measurements. For example, if your original measurements are in inches, their variance is in square inches. This is because you square the deviations before you average them. So the variance in the five-score population in the preceding example is 6.8 square inches. It might be hard to grasp what that means. Often, it's more intuitive if the variation statistic is in the same units as the original measurements. It's easy to turn variance into that kind of statistic. All you have to do is take the square root of the variance. Like the variance, this square root is so important that it is has a special name: standard deviation. Population standard deviation The standard deviation of a population is the square root of the population variance. The symbol for the population standard deviation is Σ (sigma). Its formula is For this 5-score population of measurements (in inches): 50, 47, 52, 46, and 45 the population variance is 6.8 square inches, and the population standard deviation is 2.61 inches (rounded off). Sample standard deviation The standard deviation of a sample — an estimate of the standard deviation of a population — is the square root of the sample variance. Its symbol is s and its formula is For this sample of measurements (in inches): 50, 47, 52, 46, and 45 the estimated population variance is 8.4 square inches, and the estimated population standard deviation is 2.92 inches (rounded off). Using R to compute standard deviation As is the case with variance, using R to compute the standard deviation is easy: You use the sd() function. And like its variance counterpart, sd() calculates s, not Σ: > sd(heights) [1] 2.915476 For Σ — treating the five numbers as a self-contained population, in other words — you have to multiply the sd() result by the square root of (N-1)/N: > sd(heights)*(sqrt((length(heights)-1)/length(heights))) [1] 2.607681 Again, if you're going to use this one frequently, defining a function is a good idea: sd.p=function(x){sd(x)*sqrt((length(x)-1)/length(x))} And here's how you use this function: > sd.p(heights) [1] 2.607681
View ArticleArticle / Updated 04-11-2018
If you’ve been working with images, animated images, and combined stationary images in R, it may be time to take the next step. This project walks you through the next step: Combine an image with an animated image. This image shows the end product — the plot of the iris data set with comedy icons Laurel and Hardy positioned in front of the plot legend. When you open this combined image in the Viewer, you see Stan and Ollie dancing their little derbies off. (The derbies don’t actually come off in the animation, but you get the drift.) Getting Stan and Ollie Check out the Laurel and Hardy GIF. Right-click the image and select Save Image As from the pop-up menu that appears. Save it as animated-dancing-image-0243 in your Documents folder. Then read it into R: l_and_h <- image_read("animated-dancing-image-0243.gif") Applying the length() function to l_and_h > length(l_and_h) [1] 10 indicates that this GIF consists of ten frames. To add a coolness factor, make the background of the GIF transparent before image_read() works with it. This free online image editor does the job quite nicely. Combining the boys with the background If you use the image combination technique, the code looks like this: image_composite(image=background, composite_image=l_and_h, offset = "+510+200") The picture it produces looks like the image above but with one problem: The boys aren’t dancing. Why is that? The reason is that image_composite() combined the background with just the first frame of l_and_h, not with all ten. It’s exactly the same as if you had run image_composite(image=background, composite_image=l_and_h[1], offset = "+510+200") The length() function verifies this: > length(image_composite(image=background, composite_image=l_and_h, offset = "+510+200")) [1] 1 If all ten frames were involved, the length() function would have returned 10. To get this done properly, you have to use a magick function called image_apply(). Explaining image_apply() So that you fully understand how this important function works, let's describe an analogous function called lapply(). If you want to apply a function (like mean()) to the variables of a data frame, like iris, one way to do that is with a for loop: Start with the first column and calculate its mean, go to the next column and calculate its mean, and so on until you calculate all the column means. For technical reasons, it’s faster and more efficient to use lapply() to apply mean() to all the variables: > lapply(iris, mean) $Sepal.Length [1] 5.843333 $Sepal.Width [1] 3.057333 $Petal.Length [1] 3.758 $Petal.Width [1] 1.199333 $Species [1] NA A warning message comes with that last one, but that’s okay. Another way to write lapply(iris, mean) is lapply(iris, function(x){mean(x)}). This second way comes in handy when the function becomes more complicated. If, for some reason, you want to square the value of each score in the data set and then multiply the result by three, and then calculate the mean of each column, here’s how to code it: lapply(iris, function(x){mean(3*(x^2))}) In a similar way, image_apply() applies a function to every frame in an animated GIF. In this project, the function that gets applied to every frame is image_composite(): function(frame){image_composite(image=background, composite_image=frame, offset = "+510+200")} So, within image_apply(), that’s frames <- image_apply(image=l_and_h, function(frame) { image_composite(image=background, composite_image=frame, offset = "+510+200") }) After you run that code, length(frames) verifies the ten frames: > length(frames) [1] 10 Getting back to the animation The image_animate() function puts it all in motion at ten frames per second: animation <- image_animate(frames, fps = 10) To put the show on the screen, it’s print(animation) All together now: l_and_h <- image_read("animated-dancing-image-0243.gif") background <- image_background(iris_plot, "white) frames <- image_apply(image=l_and_h, function(frame) { image_composite(image=background, composite_image=frame, offset = "+510+200") }) animation <- image_animate(frames, fps = 10) print(animation) And that’s the code for the image above. One more thing. The image_write() function saves the animation as a handy little reusable GIF: image_write(animation, "LHirises.gif")
View ArticleArticle / Updated 04-11-2018
Here, you learn about books and websites that help you learn more about R programming. Without further ado. . . Interacting with users If you want to delve deeper into R applications that interact with users, start with this tutorial by shiny guiding force Garrett Grolemund. For a helpful book on the subject, consider Chris Beeley’s web Application Development with R Using Shiny, 2nd Edition (Packt Publishing, 2016). Machine learning For the lowdown on all things Rattle, go directly to the source: Rattle creator Graham Williams has written Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery (Springer, 2011). Check out the companion website. The University of California-Irvine Machine Learning Repository plays such a huge role in the R programming world. Here’s how its creator prefers that you look for the material: Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. Thank you, UCI Anteaters! If machine learning interests you, take a comprehensive look at the field (under its other name, “statistical learning”): Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani’s An Introduction to Statistical Learning with Applications in R (Springer, 2017). An Introduction to Neural Networks, by Ben Krose and Patrick van der Smagt, is a little dated, but you can get it for the low, low price of nothing: After you download a large PDF, it’s a good idea to upload it into an ebook app, like Google Play Books. That turns the PDF into an ebook and makes it easier to navigate on a tablet. Databases The R-bloggers website has a nice article on working with databases. Of course, R-bloggers has terrific articles on a lot of R-related topics! You can learn quite a bit about RFM (Recency Frequency Money) analysis and customer segmentation at www.putler.com/rfm-analysis. Maps and images The area of maps is a fascinating one. You might be interested in something at a higher level. If so, read Introduction to visualising spatial data in R by Robin Lovelace, James Cheshire, Rachel Oldroyd (and others). David Kahle and Hadley Wickham’s ggmap: Spatial Visualization with ggplot2 is also at a higher level. Fascinated by magick? The best place to go is the primary source. Check it out.
View ArticleArticle / Updated 04-11-2018
Try out this R project to see how one variable might affect an outcome. It’s conceivable that weather conditions could influence flight delays. How do you incorporate weather information into the assessment of delay? One nycflights13 data frame called weather provides the weather data for every day and hour at each of the three origin airports. Here’s a glimpse of exactly what it has: > glimpse(weather,60) Observations: 26,130 Variables: 15 $ origin "EWR", "EWR", "EWR", "EWR", "EWR", "... $ year 2013, 2013, 2013, 2013, 2013, 2013, ... $ month 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... $ day 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... $ hour 0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 1... $ temp 37.04, 37.04, 37.94, 37.94, 37.94, 3... $ dewp 21.92, 21.92, 21.92, 23.00, 24.08, 2... $ humid 53.97, 53.97, 52.09, 54.51, 57.04, 5... $ wind_dir 230, 230, 230, 230, 240, 270, 250, 2... $ wind_speed 10.35702, 13.80936, 12.65858, 13.809... $ wind_gust 11.918651, 15.891535, 14.567241, 15.... $ precip 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... $ pressure 1013.9, 1013.0, 1012.6, 1012.7, 1012... $ visib 10, 10, 10, 10, 10, 10, 10, 10, 10, ... $ time_hour 2012-12-31 19:00:00, 2012-12-31 20:... So the variables it has in common with flites_name_day are the first six and the last one. To join the two data frames, use this code: flites_day_weather <- flites_day %>% inner_join(weather, by = c("origin","year","month","day","hour","time_hour")) Now you can use flites_day_weather to start answering questions about departure delay and the weather. What questions will you ask? How will you answer them? What plots will you draw? What regression lines will you create? Will scale() help? And, when you’re all done, take a look at arrival delay (arr_delay).
View Article