In the last post of this series (1), we discussed the meaning of terms like Big Data, MapReduce and Hadoop and I presented the origin of Spark and its dominance in the Big Data world. The main advantage of Spark over Hadoop is that it was much faster and much easier to learn and develop applications. Now, in this post, I will talk about Spark in more details and its main components.
In general terms, Apache Spark is a unified engine designed for large-scale distributed processing, either running on-premise or in the cloud. One of the main reasons that Spark is much faster than Hadoop MapReduce is because it provides in-memory storage for intermediate computations, instead of relying on disk. Spark is built around four key characteristics: Speed, Ease of Use, Modularity, and Extensibility.
There are several ways that Spark was developed with speed in mind. The first of them is that Spark’s internal implementation benefits from the most recent hardware industry’s commodity servers, by being able to take advantage of multiple machines with efficient multi-threading and parallel processing. Spark also builds its computation as a directed acyclic graph (DAG). The query optimizer then builds an efficient computations graph that can be decomposed into several tasks that are executed in parallel across the workers in the cluster. Finally, its physical execution engine (Tungsten) uses whole-stage code generation to generate compact code for execution.
To achieve ease of use, Spark provides a fundamental abstraction of a data structure called a Resilient Distributed Dataset (RDD) upon which all other higher-level structures (like Dataframes and Datasets) are built upon. This also lets the user to build most applications as a combination of pre-defined transformations and actions, avoiding the need to implement things from scratch.
In relation to modularity, Spark code can be written in very similar ways in five different programming languages: Scala, Java, Python, SQL, and R. Each of those languages’ interfaces with the same APIs that are the core components of Spark: Spark SQL, Spark Structured Streaming, Spark MLLib and GraphX. One single Spark application can use any combination of those four components, without needing different applications for each use case.
Finally, talking about extensibility, one of the differences between Apache Spark and Apache Hadoop is that Spark decouples the storage from the computing, focusing only on the parallel computation engine. That way you can use Spark to read data from several different data sources, such as Apache Hadoop, Apache HBase, MongoDB, Apache Hive, RDBMSs, Apache Kafka, Kinesis, Amazon S3, etc. There are also a lot of third-party software projects that supplement Apache Spark and add to its ecosystem linked to the Apache Spark page.
As shown in the figure below, Spark is composed of four different components that are separated from Spark’s core fault-tolerant engine, so that you use these APIs to build your application and Spark converts it into a DAG that is executed by the core engine.
Spark SQL is the component designed to work with structured data. With this API, you can read data stored in databases tables or in file formats with structured data (such as CSV, text, JSON, Avro, ORC, Parquet, etc.) and then transform them into tables in Spark (similar, for example, to what we can do in Python using Pandas). After transforming this data into tables, we can use the Spark SQL API to process it with a series of SQL-like transformations (such as filter, join, group by, etc) either by using SQL directly or some built-in commands.
Spark MLLib is the Spark implementation of common machine learning algorithms, all implemented in a way to make use of parallelization as much as possible. It contains several algorithms for common use cases (such as Linear and Logistic Regression, Decision Trees, Random Forest and Gradient-Boosted Trees, recommendation systems, clustering, topic modeling, etc.) and also a lot of other functions used in ML workflows, such as feature transformations, model evaluations and hyperparameter search.
Spark Structured Streaming is the API introduced in version 2.0 that can combine and react in real-time to both static data and streaming data coming from engines like Apache Kafka or Kinesis. The data is treated like it was a continually growing table, so a lot of abstractions and patterns used on batch-processing (in Spark SQL) can be used here. This API also handles all aspects of fault tolerance and late-data semantics, so it reduces the complexity of building streaming applications.
Finally, GraphX is the library used for manipulating graphs, such as social networks, network topology graphs and routes and connections points. Besides allowing the user to perform graph-parallel computations, it also offers the standard graph algorithm for analysis, connections and traversals, such as the PageRank, the Connect Components and the Triangle Counting.
I hope that with this introduction you can now have a better sense of what is Spark and how does it work. In the next post of the series, I will talk about the three main data structures in Spark: RDDs, Datasets and Dataframes, their characteristics and when to use each one of them.