The Art of Data Science in Spark
Apache Spark, or simply “Spark,” is a highly distributed, fault-tolerant, scalable framework that processes massive amounts of data. As it processes data, Spark abstracts the distribution of the data computations via a machine cluster thus enabling you to create applications using Java, Scala, Python, R, and SQL.
Spark has multiple APIs that make life easier for data scientists and analysts when processing and cleaning data:
- The Spark platform allows data analysts to use SQL code to process data.
- MLlib is a scalable machine learning API with several algorithms ready to use to effectively train various models.
- Spark Streaming is an API that makes the build-out of scalable, fault-tolerant streaming applications more streamlined and easier for data scientists. This API library enables you to process data in mini-batches while being able to tune various aspects of the process at each step.
- NOTE on the Sparking Streaming API: recently, the Structured Streaming API was launched. This new API runs on top of the Spark SQL API by making use of the dataframes.
- GraphX is an API that allows for seamless graph analysis and graph-parallel computations.
Databricks — the company behind Apache Spark — was funded by the creators of the Spark project and have many supporters contributing to their open source endeavors.
As part of a full-day workshop at ODSC East 2018 on the use of data science in Spark, Adam Breindel explained how to start a cluster using Databricks Community Edition and exploring data using Spark SQL. Afterward, he covered the basics and more advanced topics of machine learning using the Spark MLlib library.
To get started using Spark, a team of scientists from Databricks enabled a sandbox environment for you to learn and test your Spark projects and code. You just need to create a community edition account at Databricks.
Once you’re set up with an account, there are feature notebooks that you can explore. If you’re a beginner then you can go through everything in order. If you’re already a bit familiar with Spark then it’s recommended you start with Databricks for Data Scientists.
In order to explore the notebooks or test your own code, you will need to create a cluster. In just a few clicks you can create a cluster with 6GB of memory. Note that the cluster will automatically terminate after 2 hours of idling.
You can import your own notebook file to the Databricks CE in any of these formats: .dbc, .scala, .py, .sql, .r, .ipynb, .html. This is a helpful way to check your code and test it with a sample dataset, or even to run small projects.
The Databricks for Data Scientists notebook also contains a link to a documentation notebook for data engineers that is worth looking into if you want to learn more about data engineering in Spark.
Data Science and Spark
In general, for any data science project, you will start with some exploratory data analysis (EDA). The EDA phase is about exploring the datasets in terms of the problem you are working on. EDA will include univariate and multivariate analysis. If you are working with structured data then SQL is a great option. EDA is usually performed using SQL to query the data and get a basic understanding of the datasets.
Spark provides the Spark SQL library where you can query your data using SQL syntax. This SQL code is transformed to the right scala commands to process your data. However, this abstraction facilitates the analyst job of exploring the data using SQL code, which most analysts are already familiar with performing.
You can upload data to your cluster and use Spark SQL to explore it.
Next, you can use SQL code just by adding %sql at the beginning of the cell in the notebook. This will allow you to write SQL syntax to process your data in Spark.
The Spark notebook provided by Databricks has an easy way to visualize your results from the queries. Once you get your analyses back, you can modify the options to visualize the results — an especially exciting feature designed for exploring categorical data or time series data.
Once you have covered the data exploration part and maybe created some new features in your dataset, you are ready to start training a machine learning models.
Apache Spark MLlib is the library in Spark for machine learning tasks. You have some machine learning algorithms for regression, classification, and clustering problems. You can apply feature engineering for feature extraction, dimensionality reduction, and selection, you can create pipelines and also select the best models with the hyperparameter tuning, and you can also save your trained machine learning models to deploy them later. Scala and Python are also good candidates for use with MLlib if you’re more familiar with those.
Check out the documentation of MLlib in the Apache Spark project website.
If you prefer an electronic version of what I just told you, you can download The Apache Spark Collection from Databricks. It contains all the relevant information for data engineers scientists along with a gentle introduction to Spark.
Practicing data science in Spark is a go. You can start performing EDA using the Spark SQL library, and it’s possible to create derived variables or new features to then train your machine learning models using the Apache Spark MLlib.
Python, R, Java, and Scala are all great options for your machine learning tasks. Pipelines and model tuning can be computed for the models using Spark MLlib. Lastly, all of your data exploration projects can be made into useful visualizations of the notebooks, making it a viable option for reporting.