The “processing frameworks” are one of the most essential components of a Big Data systems. There are three categories of such frameworks namely: Batch-only frameworks (Hadoop), Stream-only frameworks (Storm, Samza), and Hybrid frameworks (Spark, Hive and Flink). In this lecture, we will introduce them and cover one of the major Big Data frameworks, Apache Spark. We will cover Spark fundamentals and the model of “Resilient Distributed Datasets (RDDs)” that are used in Spark to implement in-memory batch computation. Furthermore, essential parts of the important practical techniques will be introduced such as Hadoop Distributed File System for the data resiliency, and the "lineage" property of “Directed Acyclic Graphs (DAG)” to achieve resilience for the computation resiliency, or use of catalyst for code optimization.
Please, downoload from the following link.