You have probably heard terms like Big Data, Hadoop, MapReduce, Spark but maybe you don’t know exactly what they mean and how they interact. I will start this post by clarifying some of this.
The term big data started being used around the 1990s to describe the growing number and size of available data sets, which lead to challenges in how to deal with this new data. Other definitions described big data using 3 Vs: Volume, Variety and Velocity:
– Volume: The quantity of generated and stored data. There’s not a clear-cut line here, but in general the size of big data is usually larger than terabytes and petabytes (this size is also growing, so that what constituted big data ten years ago is not big data anymore).
– Variety: The type and nature of data. Besides the traditional data stored in relational databases, big data started to deal with semi-structured and unstructured data, such as text, images, videos and logs.
– Velocity: The speed at which the data is generated and processed. Compared to small data, big data is produced more continually, often being available in real-time (streaming).
Over time the term big data became a catch-all term and have fallen a little out of use, but the problem of dealing with those huge datasets is still present. Still one of the biggest challenged to deal with big data is the usage of parallel computing tools and distributed systems and there is where we introduce our next term: Hadoop.
When we think about scale, one of the first things that comes to mind is Google’s search engine capacity to index and search the entire internet at lightning speed. Around the 2000s, Google was faced of this problem and neither traditional relational databased nor imperative ways of programming were able the handle the scale Google wanted to work, so they developed three solutions that were the basis for Hadoop: the Google File System (GFS), MapReduce (MR) and Bigtable.
The Google File System was a solution to provide a fault-tolerant and distributed filesystem across many cheap hardware servers in a cluster farm (instead of a huge, specialized machine), whereas Bigtable offered scalable storage of structured data across GFS. MapReduce introduced a new parallel programming paradigm based on functional programming that worked for large-scale processing of data distributed over GFS and Bigtable.
This approach working by MapReduce send computation code (map and reduce functions) to where data resides, favoring data locality and cluster rack affinity. The workers in the cluster aggregate and reduce the intermediate computations and produce a final append output from the reduce function, which is then written to a distributed storage where it is accessible to your application. This approach significantly reduced network traffic and keep most of the input/output local to disk rather than distributing it over the network. Most of this work was proprietary, but the main ideas were expressed in those three papers.
Those papers spurred innovative ideas around the open-source community, especially at Yahoo!, which was facing similar challenges of scale for its search engine. Yahoo! then developed the Hadoop File System (HDFS), that in 2006 was donated to the Apache Software Foundation and became part of the Apache Hadoop framework of related modules: Hadoop Common, MapReduce, HDFS, and Apache Hadoop YARN.
In essence, the HDFS is the distributed file system where the data resides (similar to the Google File System), YARN is the resource management used to manage the cluster (RAM, processing, etc.) and MapReduce is the framework used to process and transform the data itself.
Apache Hadoop achieved widespread adoption outside Yahoo!, creating a large open-source community and two open source-based commercial companies (Cloudera and Hortonworks). However, Hadoop had a few shortcomings.
First, it was hard to mange and administer, with cumbersome operational complexity. Second, its general batch-processing MapReduce API was verbose and required a lot of boilerplate setup code. Third, for jobs consisting of many pairs of MapReduce tasks, each pair’s intermediate computed results were written to the local disk for the subsequent stage of its operations (as expressed in the figure below), so this repeated performance of disk I/O could make large jobs run for hours or even days.
Finally, Hadoop MapReduce was developed for large-scale jobs for batch processing but didn’t work combining with other workloads such as machine learning, streaming, or interactive SQL-like queries.
To work around those issues, the community developed several distinct systems that interacted with Hadoop, such as Apache Hive, Apache Storm, Apache Impala, Apache Graph, Apache Drill, Pig, each with their own APIs and cluster configurations, adding to an already big operational complexity of Hadoop and further increasing the steep learning curve for developers.
In 2009, researchers at UC Berkeley, who had previously worked on Hadoop Map Reduce, took the challenge to make Hadoop simpler and faster. They started a project called Spark. They thought that MapReduce was inefficient for interactive or iterative computing jobs and a complex framework to learn, so since the start, they decided to make Spark simpler, faster, and easier.
The central idea to the Spark project was to use ideas from the Hadoop MapReduce and enhance them: make it highly fault tolerant, more parallel, support in-memory storage for intermediate results (instead of saving everything to disk), offer easy and composable APIs in multiple languages in support other workloads in a unified manner. Performance wise, Spark was a success since the start: an early paper in 2010 showed that it was 10 to 20 times faster than Hadoop Map Reduce for certain jobs, whereas today it’s many orders of magnitude faster.
By 2013 Spark had already gained widespread use and some of its original creators donated the Spark project to the Apache Software Foundation and started a company called Databricks. Databricks and the open-source community worked to release Apache Spark 1.0 in May 2014, followed by Spark 2.0 in July 2016 and, more recently, the 3.0 version in June 2020.
In the next post of this series, I will talk more about what is Spark and what are its main components.