Using Python for data science
Python is an easy-to-learn, human-readable programming language that you can use for advanced data munging, analysis, and visualization. You can install it and set it up incredibly easily, and you can more easily learn Python than the R programming language. Python runs on Mac, Windows, and UNIX.IPython offers a very user-friendly coding interface for people who don't like coding from the command line. If you download and install the Anaconda Python distribution, you get your IPython/Jupyter environment, as well as NumPy, SciPy, MatPlotLib, Pandas, and scikit-learn libraries (among others) that you'll likely need in your data sense-making procedures.
The base NumPy package is the basic facilitator for scientific computing in Python. It provides containers/array structures that you can use to do computations with both vectors and matrices (like in R). SciPy and Pandas are the Python libraries that are most commonly used for scientific and technical computing.
They offer tons of mathematical algorithms that are simply not available in other Python libraries. Popular functionalities include linear algebra, matrix math, sparse matrix functionalities, statistics, and data munging. MatPlotLib is Python's premiere data visualization library.
Lastly, the scikit-learn library is useful for machine learning, data pre-processing, and model evaluation.
Using R for data science
R is another popular programming language that's used for statistical and scientific computing. Writing analysis and visualization routines in R is known as R scripting. R has been specifically developed for statistical computing, and consequently, it has a more plentiful offering of open-source statistical computing packages than Python's offerings.Also, R's data visualizations capabilities are somewhat more sophisticated than Python's, and generally easier to generate. That being said, as a language, Python is a fair bit easier for beginners to learn.
R has a very large and extremely active user community. Developers are coming up with (and sharing) new packages all the time — to mention just a few, the forecast
package, the ggplot2
package, and the statnet/igraph
packages.
If you want to do predictive analysis and forecasting in R, the forecast package is a good place to start. This package offers the ARMA, AR, and exponential smoothing methods.
For data visualization, you can use theggplot2
package, which has all the standard data graphic types, plus a lot more.Lastly, R's network analysis packages are pretty special as well. For example, you can use igraph
and StatNet
for social network analysis, genetic mapping, traffic planning, and even hydraulic modeling.