In this post, we will cover the Overview of Spark & MapReduce, and we will cover the brief difference between Spark & MapReduce, also we will going to discuss some of the use cases examples of Spark & MapReduce.
We will discuss How did Spark become so efficient in data processing compared to MapReduce?
Overview
MapReduce
MapReduce is a programming engine for processing and generating large data sets with a parallel, distributed algorithm on a cluster of the computer.
MapReduce is composed of several components, including :
- JobTracker — The master node that manages all jobs and resources in a cluster
- TaskTrackers — Agents deployed to each machine in the cluster to run the map and reduce tasks
- JobHistoryServer — A component that tracks completed jobs, and is typically deployed as a separate function or with JobTracker
SPARK
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Spark serves several languages Scala, Python, R, and Java.
Apache Spark is a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics.
Now, when you understand the basic overview of Spark & MapReduce, let’s check some main difference between both.
Difference Between Spark & MapReduce
Spark stores data in-memory whereas MapReduce stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O.
Spark’s Major Use Cases Over MapReduce
- Iterative Algorithms in Machine Learning
- Interactive Data Mining and Data Processing
- Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
- Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates, and analysis
- Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.
Below are Some Use Cases & Scenarios That Will Explain the Benefits & Advantages of Spark over MapReduce.
Some scenarios have solutions with both MapReduce and Spark, which makes it clear as to why one should opt for Spark when writing long codes.
Scenario 1: Simple word count example in MapReduce and Spark
The same code in MapReduce.
MapReduce
Step 1: Create a text file on which processing is to be done.
hadoop fs -mkdir -p /user/$USER/input
Step 2: Copy the text file from local file system to hdfs
hadoop fs -copyFromLocal sample.txt input
Step 3: Create wordcount file
package wc;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
public class WordCount extends Configured implements Tool {
public static void main(String args[]) throws Exception {
int res = ToolRunner.run(new WordCount(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);
Configuration conf = getConf();
Job job = new Job(conf, this.getClass().toString());
FileInputFormat.setInputPaths(job, inputPath);
FileOutputFormat.setOutputPath(job, outputPath);
job.setJobName(“WordCount”);
job.setJarByClass(WordCount.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
Mapper.Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
}
Step 4: Execute the jar file
jar cf wordcount.jar WordCount*.class
hadoop jar wordcount.jar WordCount input output
Step 5: Check the output from the two partitions
hadoop fs -tail output/part-r-00000 | tail > sample-tail.out
Spark
Step 1: Open up the spark-shell (Scala or python)
Step 2: In scala shell
val rdd1=sc.textFile(“sample.txt”)
val rdd2=rdd1.flatMap(line => line.split( ))
val rdd3=rdd2.map(word => (word,1))
val rdd4=rdd3.reduceByKey((v1,v2)=>(v1+v2))
rdd4.collect()
rdd4.saveAsTextFile(“/user/input/wordcount”)
VERDICT
The 100 lines of code of a simple Word Count Program have been limited to just less than 10 lines. It shows the efficiency of Spark and ease in code.
You will get to know all of this and deep-dive into each concept related to Hadoop Development & Apache Spark, once you will get enrolled in our Hadoop Developer Using Apache Spark
Another question, which might come to your mind, What are all the things you will get when you enrolled!!
We are glad to tell you that:
Things you will get!!
- Live Instructor-led Online Interactive Sessions
- FREE unlimited retake for next 1 Years
- FREE On-Job Support for next 1 Years
- Training Material (Presentation + Step by Step Hands-on Guide)
- Recording of Live Interactive Session for Lifetime Access
- 100% Money Back Guarantee (If you attend sessions, practice and don’t get results, We’ll do full REFUND, check our Refund Policy)
Reference & Related
Next Task for you:
Did you get a chance to download FREE Guide on Big Data Hadoop Development? If not, then get it now by clicking on the link below.
The post Spark & MapReduce: Introduction, Differences & Use Case appeared first on Oracle Trainings.