Distributed Big Data Frameworks
The “processing frameworks” are one of the most essential components of a Big Data systems. There are three categories of such frameworks namely: Batch-only frameworks (Hadoop), Stream-only frameworks (Storm, Samza), and Hybrid frameworks (Spark, Hive and Flink). In this lecture, we will introduce them and cover one of the major Big Data frameworks, Apache Spark. We will cover Spark fundamentals and the model of “Resilient Distributed Datasets (RDDs)” that are used in Spark to implement in-memory batch computation.