Machine Learning: Using Spark to Deal with Massive Data

TensorFlow For Dummies

The real world of machine learning relies heavily on huge datasets. Imagine trying to wend your way through the enormous data generated just by the sales made by Amazon.com every day. The point is that you need products that help you manage these huge datasets in a manner that makes them easier to work with and faster to process. This is where Spark comes in. It relies on a clustering technique.

The emphasis of Spark is speed. When you visit the site, you’re greeted by statistics, such as Spark’s capability to process data a hundred times faster than other products, such as Hadoop MapReduce (see the tutorial) in memory. However, Spark also offers flexibility in that it works with Java, Scala, Python, and R, and it runs on any platform that supports Apache. You can even run Spark in the cloud if you want.

Spark works with huge datasets, which means that you need to know programming languages, database management, and other developer techniques to use it. This means that the Spark learning curve can be quite high, and you need to provide time for developers on your team to learn it. The simple examples at Spark’s website give you some ideas of just what is involved. Notice that all the examples include some level of coding, so you really do need to have programming skills to use this option.

About This Article

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.